l_ai_knowledge/dataset/README

# QA Dataset Sampling Tool

A comprehensive tool for sampling QA datasets and generating answers using OpenAI's GPT models. This tool helps you create high-quality question-answering datasets from large-scale collections like MS MARCO.

## Features

- **Smart Sampling**: Intelligently sample queries, documents, and relevance judgments from large datasets
- **Answer Generation**: Automatically generate high-quality answers using OpenAI's GPT models
- **Resume Support**: Continue interrupted answer generation from where it left off
- **Progress Tracking**: Real-time progress updates and statistics
- **Result Visualization**: Easy-to-read display of generated QA pairs with context

## Installation

### Prerequisites

- Python 3.7+
- OpenAI API key

### Install Dependencies

```bash
pip install pandas pyarrow openai
```

### Set Environment Variables

```bash
export OPENAI_API_KEY="your-openai-api-key"
# Optional: Use custom OpenAI endpoint
export OPENAI_BASE_URL="https://api.openai.com/v1"
```

### Parpare dataset

We provide pre-processed samples from popular QA datasets:

MarkrAI/msmarco_sample_autorag

## Quick Start

### 1. Sample Data from Large Dataset

First, sample a subset of queries, documents, and relevance judgments from your full dataset:

```bash
python dataset/qa_dataset.py sample \
  --queries ~/dataset/mmarco-queries.parquet \
  --corpus ~/dataset/mmarco-corpus.parquet \
  --qrels ~/dataset/mmarco-qrels.parquet \
  --nq 100 \
  --output_dir ./dataset/samples
```

### 2. Generate Answers

Use OpenAI's GPT model to generate answers for the sampled questions:

```bash
python dataset/qa_dataset.py generate \
  --input_dir ./dataset/samples \
  --output_dir ./dataset/samples
```

### 3. View Results

Display the generated QA pairs with their context:

```bash
python dataset/qa_dataset.py show \
  --input_dir ./dataset/samples \
  -n 5
```

## Detailed Usage

### Sample Command

Create a representative sample from your full dataset.

```bash
python dataset/qa_dataset.py sample [OPTIONS]
```

**Required Parameters:**
- `--queries`: Path to queries parquet file (columns: `id`, `text`)
- `--corpus`: Path to corpus parquet file (columns: `id`, `text`)
- `--qrels`: Path to qrels parquet file (columns: `qid`, `pid`)

**Optional Parameters:**
- `--nq`: Number of queries to sample (default: 1000)
- `--output_dir`: Output directory for sampled data (default: ./save)

**Example:**
```bash
python dataset/qa_dataset.py sample \
  --queries data/queries.parquet \
  --corpus data/corpus.parquet \
  --qrels data/qrels.parquet \
  --nq 500 \
  --output_dir ./my_sample
```

### Generate Command

Generate answers for sampled questions using OpenAI API.

```bash
python dataset/qa_dataset.py generate [OPTIONS]
```

**Required Parameters:**
- `--input_dir`: Directory containing sampled data (queries.parquet, corpus.parquet, qrels.parquet)

**Optional Parameters:**
- `--output_dir`: Output directory for generated answers (default: ./save)

**Features:**
- **Resume Support**: Automatically continues from where it left off if interrupted
- **Error Handling**: Retries failed API calls up to 3 times
- **Progress Saving**: Saves progress after each successful answer generation

**Example:**
```bash
python dataset/qa_dataset.py generate \
  --input_dir ./my_sample \
  --output_dir ./my_sample
```

### Show Command

Display generated QA pairs with full context.

```bash
python dataset/qa_dataset.py show [OPTIONS]
```

**Required Parameters:**
- `--input_dir`: Directory containing QA data (queries.parquet, corpus.parquet, qrels.parquet, qas.parquet, answers.parquet)

**Optional Parameters:**
- `-n`: Number of results to display (default: 5)

**Example:**
```bash
python dataset/qa_dataset.py show \
  --input_dir ./my_sample \
  -n 3
```

## Input Data Format

### Queries File (queries.parquet)
| Column | Type | Description |
|--------|------|-------------|
| id | string | Unique query identifier |
| text | string | The actual question text |

### Corpus File (corpus.parquet)
| Column | Type | Description |
|--------|------|-------------|
| id | string | Unique passage/document identifier |
| text | string | The passage/document content |

### Qrels File (qrels.parquet)
| Column | Type | Description |
|--------|------|-------------|
| qid | string | Query ID (matches queries.id) |
| pid | string | Passage ID (matches corpus.id) |

## Output Files

After running all commands, your output directory will contain:

### Sampled Data
- `queries.parquet`: Sampled queries subset
- `corpus.parquet`: Sampled documents subset
- `qrels.parquet`: Sampled relevance judgments

### Generated Answers
- `answers.parquet`: Generated answers with unique IDs
- `qas.parquet`: Question-answer mapping (qid → aid)

## Advanced Usage

### Custom OpenAI Configuration

You can use different OpenAI models or endpoints:

```bash
# Use GPT-4 Turbo
export OPENAI_API_KEY="your-key"
python dataset/qa_dataset.py generate --input_dir ./samples

# Use Azure OpenAI
export OPENAI_API_KEY="azure-key"
export OPENAI_BASE_URL="https://your-resource.openai.azure.com/openai/deployments/gpt-4"
python dataset/qa_dataset.py generate --input_dir ./samples
```

### Large Dataset Sampling

For very large datasets, consider sampling in batches:

```bash
# First batch
python dataset/qa_dataset.py sample --nq 1000 --output_dir ./batch1
python dataset/qa_dataset.py generate --input_dir ./batch1

# Second batch
python dataset/qa_dataset.py sample --nq 1000 --output_dir ./batch2
python dataset/qa_dataset.py generate --input_dir ./batch2
```

## Troubleshooting

### Common Issues

**1. OpenAI API Errors**
- Ensure your API key is set correctly: `echo $OPENAI_API_KEY`
- Check your API quota and billing status
- Verify network connectivity to OpenAI

**2. Memory Issues with Large Datasets**
- Reduce `--nq` parameter for smaller samples
- Ensure sufficient RAM for pandas operations
- Consider using smaller parquet files

**3. File Not Found Errors**
- Verify all input file paths are correct
- Ensure parquet files have correct column names
- Check file permissions

### Debug Mode

Enable verbose output by adding print statements or using Python debugger:

```bash
python -m pdb dataset/qa_dataset.py sample --queries ...
```

## Example Workflow

```bash
# 1. Setup environment
export OPENAI_API_KEY="sk-..."

# 2. Sample 200 queries from MS MARCO
python dataset/qa_dataset.py sample \
  --queries ~/mmarco/queries.parquet \
  --corpus ~/mmarco/corpus.parquet \
  --qrels ~/mmarco/qrels.parquet \
  --nq 200 \
  --output_dir ./marco_sample

# 3. Generate answers (may take time depending on API rate limits)
python dataset/qa_dataset.py generate \
  --input_dir ./marco_sample \
  --output_dir ./marco_sample

# 4. Review results
python dataset/qa_dataset.py show \
  --input_dir ./marco_sample \
  -n 10
```

## Contributing

Feel free to submit issues and enhancement requests!

## License

MIT License - feel free to use this tool for your research and projects.