273 lines
		
	
	
		
			6.8 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
			
		
		
	
	
			273 lines
		
	
	
		
			6.8 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
# QA Dataset Sampling Tool
 | 
						|
 | 
						|
A comprehensive tool for sampling QA datasets and generating answers using OpenAI's GPT models. This tool helps you create high-quality question-answering datasets from large-scale collections like MS MARCO.
 | 
						|
 | 
						|
## Features
 | 
						|
 | 
						|
- **Smart Sampling**: Intelligently sample queries, documents, and relevance judgments from large datasets
 | 
						|
- **Answer Generation**: Automatically generate high-quality answers using OpenAI's GPT models
 | 
						|
- **Resume Support**: Continue interrupted answer generation from where it left off
 | 
						|
- **Progress Tracking**: Real-time progress updates and statistics
 | 
						|
- **Result Visualization**: Easy-to-read display of generated QA pairs with context
 | 
						|
 | 
						|
## Installation
 | 
						|
 | 
						|
### Prerequisites
 | 
						|
 | 
						|
- Python 3.7+
 | 
						|
- OpenAI API key
 | 
						|
 | 
						|
### Install Dependencies
 | 
						|
 | 
						|
```bash
 | 
						|
pip install pandas pyarrow openai
 | 
						|
```
 | 
						|
 | 
						|
### Set Environment Variables
 | 
						|
 | 
						|
```bash
 | 
						|
export OPENAI_API_KEY="your-openai-api-key"
 | 
						|
# Optional: Use custom OpenAI endpoint
 | 
						|
export OPENAI_BASE_URL="https://api.openai.com/v1"
 | 
						|
```
 | 
						|
 | 
						|
### Parpare dataset
 | 
						|
 | 
						|
We provide pre-processed samples from popular QA datasets:
 | 
						|
 | 
						|
MarkrAI/msmarco_sample_autorag
 | 
						|
 | 
						|
## Quick Start
 | 
						|
 | 
						|
### 1. Sample Data from Large Dataset
 | 
						|
 | 
						|
First, sample a subset of queries, documents, and relevance judgments from your full dataset:
 | 
						|
 | 
						|
```bash
 | 
						|
python dataset/qa_dataset.py sample \
 | 
						|
  --queries ~/dataset/mmarco-queries.parquet \
 | 
						|
  --corpus ~/dataset/mmarco-corpus.parquet \
 | 
						|
  --qrels ~/dataset/mmarco-qrels.parquet \
 | 
						|
  --nq 100 \
 | 
						|
  --output_dir ./dataset/samples
 | 
						|
```
 | 
						|
 | 
						|
### 2. Generate Answers
 | 
						|
 | 
						|
Use OpenAI's GPT model to generate answers for the sampled questions:
 | 
						|
 | 
						|
```bash
 | 
						|
python dataset/qa_dataset.py generate \
 | 
						|
  --input_dir ./dataset/samples \
 | 
						|
  --output_dir ./dataset/samples
 | 
						|
```
 | 
						|
 | 
						|
### 3. View Results
 | 
						|
 | 
						|
Display the generated QA pairs with their context:
 | 
						|
 | 
						|
```bash
 | 
						|
python dataset/qa_dataset.py show \
 | 
						|
  --input_dir ./dataset/samples \
 | 
						|
  -n 5
 | 
						|
```
 | 
						|
 | 
						|
## Detailed Usage
 | 
						|
 | 
						|
### Sample Command
 | 
						|
 | 
						|
Create a representative sample from your full dataset.
 | 
						|
 | 
						|
```bash
 | 
						|
python dataset/qa_dataset.py sample [OPTIONS]
 | 
						|
```
 | 
						|
 | 
						|
**Required Parameters:**
 | 
						|
- `--queries`: Path to queries parquet file (columns: `id`, `text`)
 | 
						|
- `--corpus`: Path to corpus parquet file (columns: `id`, `text`)
 | 
						|
- `--qrels`: Path to qrels parquet file (columns: `qid`, `pid`)
 | 
						|
 | 
						|
**Optional Parameters:**
 | 
						|
- `--nq`: Number of queries to sample (default: 1000)
 | 
						|
- `--output_dir`: Output directory for sampled data (default: ./save)
 | 
						|
 | 
						|
**Example:**
 | 
						|
```bash
 | 
						|
python dataset/qa_dataset.py sample \
 | 
						|
  --queries data/queries.parquet \
 | 
						|
  --corpus data/corpus.parquet \
 | 
						|
  --qrels data/qrels.parquet \
 | 
						|
  --nq 500 \
 | 
						|
  --output_dir ./my_sample
 | 
						|
```
 | 
						|
 | 
						|
### Generate Command
 | 
						|
 | 
						|
Generate answers for sampled questions using OpenAI API.
 | 
						|
 | 
						|
```bash
 | 
						|
python dataset/qa_dataset.py generate [OPTIONS]
 | 
						|
```
 | 
						|
 | 
						|
**Required Parameters:**
 | 
						|
- `--input_dir`: Directory containing sampled data (queries.parquet, corpus.parquet, qrels.parquet)
 | 
						|
 | 
						|
**Optional Parameters:**
 | 
						|
- `--output_dir`: Output directory for generated answers (default: ./save)
 | 
						|
 | 
						|
**Features:**
 | 
						|
- **Resume Support**: Automatically continues from where it left off if interrupted
 | 
						|
- **Error Handling**: Retries failed API calls up to 3 times
 | 
						|
- **Progress Saving**: Saves progress after each successful answer generation
 | 
						|
 | 
						|
**Example:**
 | 
						|
```bash
 | 
						|
python dataset/qa_dataset.py generate \
 | 
						|
  --input_dir ./my_sample \
 | 
						|
  --output_dir ./my_sample
 | 
						|
```
 | 
						|
 | 
						|
### Show Command
 | 
						|
 | 
						|
Display generated QA pairs with full context.
 | 
						|
 | 
						|
```bash
 | 
						|
python dataset/qa_dataset.py show [OPTIONS]
 | 
						|
```
 | 
						|
 | 
						|
**Required Parameters:**
 | 
						|
- `--input_dir`: Directory containing QA data (queries.parquet, corpus.parquet, qrels.parquet, qas.parquet, answers.parquet)
 | 
						|
 | 
						|
**Optional Parameters:**
 | 
						|
- `-n`: Number of results to display (default: 5)
 | 
						|
 | 
						|
**Example:**
 | 
						|
```bash
 | 
						|
python dataset/qa_dataset.py show \
 | 
						|
  --input_dir ./my_sample \
 | 
						|
  -n 3
 | 
						|
```
 | 
						|
 | 
						|
## Input Data Format
 | 
						|
 | 
						|
### Queries File (queries.parquet)
 | 
						|
| Column | Type | Description |
 | 
						|
|--------|------|-------------|
 | 
						|
| id | string | Unique query identifier |
 | 
						|
| text | string | The actual question text |
 | 
						|
 | 
						|
### Corpus File (corpus.parquet)
 | 
						|
| Column | Type | Description |
 | 
						|
|--------|------|-------------|
 | 
						|
| id | string | Unique passage/document identifier |
 | 
						|
| text | string | The passage/document content |
 | 
						|
 | 
						|
### Qrels File (qrels.parquet)
 | 
						|
| Column | Type | Description |
 | 
						|
|--------|------|-------------|
 | 
						|
| qid | string | Query ID (matches queries.id) |
 | 
						|
| pid | string | Passage ID (matches corpus.id) |
 | 
						|
 | 
						|
## Output Files
 | 
						|
 | 
						|
After running all commands, your output directory will contain:
 | 
						|
 | 
						|
### Sampled Data
 | 
						|
- `queries.parquet`: Sampled queries subset
 | 
						|
- `corpus.parquet`: Sampled documents subset
 | 
						|
- `qrels.parquet`: Sampled relevance judgments
 | 
						|
 | 
						|
### Generated Answers
 | 
						|
- `answers.parquet`: Generated answers with unique IDs
 | 
						|
- `qas.parquet`: Question-answer mapping (qid → aid)
 | 
						|
 | 
						|
## Advanced Usage
 | 
						|
 | 
						|
### Custom OpenAI Configuration
 | 
						|
 | 
						|
You can use different OpenAI models or endpoints:
 | 
						|
 | 
						|
```bash
 | 
						|
# Use GPT-4 Turbo
 | 
						|
export OPENAI_API_KEY="your-key"
 | 
						|
python dataset/qa_dataset.py generate --input_dir ./samples
 | 
						|
 | 
						|
# Use Azure OpenAI
 | 
						|
export OPENAI_API_KEY="azure-key"
 | 
						|
export OPENAI_BASE_URL="https://your-resource.openai.azure.com/openai/deployments/gpt-4"
 | 
						|
python dataset/qa_dataset.py generate --input_dir ./samples
 | 
						|
```
 | 
						|
 | 
						|
### Large Dataset Sampling
 | 
						|
 | 
						|
For very large datasets, consider sampling in batches:
 | 
						|
 | 
						|
```bash
 | 
						|
# First batch
 | 
						|
python dataset/qa_dataset.py sample --nq 1000 --output_dir ./batch1
 | 
						|
python dataset/qa_dataset.py generate --input_dir ./batch1
 | 
						|
 | 
						|
# Second batch
 | 
						|
python dataset/qa_dataset.py sample --nq 1000 --output_dir ./batch2
 | 
						|
python dataset/qa_dataset.py generate --input_dir ./batch2
 | 
						|
```
 | 
						|
 | 
						|
## Troubleshooting
 | 
						|
 | 
						|
### Common Issues
 | 
						|
 | 
						|
**1. OpenAI API Errors**
 | 
						|
- Ensure your API key is set correctly: `echo $OPENAI_API_KEY`
 | 
						|
- Check your API quota and billing status
 | 
						|
- Verify network connectivity to OpenAI
 | 
						|
 | 
						|
**2. Memory Issues with Large Datasets**
 | 
						|
- Reduce `--nq` parameter for smaller samples
 | 
						|
- Ensure sufficient RAM for pandas operations
 | 
						|
- Consider using smaller parquet files
 | 
						|
 | 
						|
**3. File Not Found Errors**
 | 
						|
- Verify all input file paths are correct
 | 
						|
- Ensure parquet files have correct column names
 | 
						|
- Check file permissions
 | 
						|
 | 
						|
### Debug Mode
 | 
						|
 | 
						|
Enable verbose output by adding print statements or using Python debugger:
 | 
						|
 | 
						|
```bash
 | 
						|
python -m pdb dataset/qa_dataset.py sample --queries ...
 | 
						|
```
 | 
						|
 | 
						|
## Example Workflow
 | 
						|
 | 
						|
```bash
 | 
						|
# 1. Setup environment
 | 
						|
export OPENAI_API_KEY="sk-..."
 | 
						|
 | 
						|
# 2. Sample 200 queries from MS MARCO
 | 
						|
python dataset/qa_dataset.py sample \
 | 
						|
  --queries ~/mmarco/queries.parquet \
 | 
						|
  --corpus ~/mmarco/corpus.parquet \
 | 
						|
  --qrels ~/mmarco/qrels.parquet \
 | 
						|
  --nq 200 \
 | 
						|
  --output_dir ./marco_sample
 | 
						|
 | 
						|
# 3. Generate answers (may take time depending on API rate limits)
 | 
						|
python dataset/qa_dataset.py generate \
 | 
						|
  --input_dir ./marco_sample \
 | 
						|
  --output_dir ./marco_sample
 | 
						|
 | 
						|
# 4. Review results
 | 
						|
python dataset/qa_dataset.py show \
 | 
						|
  --input_dir ./marco_sample \
 | 
						|
  -n 10
 | 
						|
```
 | 
						|
 | 
						|
## Contributing
 | 
						|
 | 
						|
Feel free to submit issues and enhancement requests!
 | 
						|
 | 
						|
## License
 | 
						|
 | 
						|
MIT License - feel free to use this tool for your research and projects. |