Metadata-Version: 2.4
Name: pdbe-mmcif-validator
Version: 0.1.91
Summary: Validate mmCIF/CIF files against the PDBx/mmCIF dictionary or any CIF dictionary
Author-email: Deborah Harrus <dharrus@ebi.ac.uk>
Maintainer: Protein Data Bank in Europe (PDBe), EMBL-EBI
License: MIT
Project-URL: Homepage, https://github.com/PDBeurope/mmcif-validator
Project-URL: Documentation, https://github.com/PDBeurope/mmcif-validator/blob/main/vscode-extension/python-script/README.md
Project-URL: Repository, https://github.com/PDBeurope/mmcif-validator
Keywords: mmcif,cif,pdb,validation,pdbe,structural-biology
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Chemistry
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Provides-Extra: dev

# PDBe mmCIF Validator - Python Script

**Version 0.1.91**

A standalone Python script to validate mmCIF/CIF files against the PDBx/mmCIF dictionary or any CIF dictionary.

**0.1.91** — JSON-first procedural cross-checks (`cross_checks_procedural_validators.json`), pairwise date ordering (`cross_checks_pairwise_date_order.json`), within-category uniqueness (`cross_checks_uniqueness.json`), preservation of quoted empty loop values (`''` / `""`), dictionary-enum / subtype / cross-reference selector coverage, and centralized message-template rendering. Details: [extension CHANGELOG](../CHANGELOG.md).

**0.1.9** — Added a scalable cross-check architecture: reusable rule utilities, grouped JSON rule families, rule engine/registry, configurable rule-group toggles, expanded imported cross-check runtime coverage, and harmonized error-message formatting (including compared-value insertion).

**0.1.81** improved real-world dictionary and file handling: loops followed by key–value pairs of the same category, stricter `asym_id` checks, loop-style item definitions and `_pdbx_item_enumeration`, and deterministic ordering for regression comparisons.

## Features

- ✅ Validates mmCIF/CIF files against any CIF dictionary schema
- ✅ Checks for missing mandatory items (only for categories present in file)
- ✅ Validates item values against enumerations
- ✅ **Data type validation** - Automatically validates types with regex patterns from dictionary (email, phone, orcid_id, pdb_id, fax, etc.) plus hardcoded validations for dates, integers, floats, booleans
- ✅ **Range validation** - Distinguishes between strictly allowed boundary conditions (errors) and advisory boundary conditions (warnings)
- ✅ **Parent/child category relationship validation** - Ensures parent categories exist when child categories are present
- ✅ **Foreign key integrity validation** - Ensures referenced data exists in parent items
- ✅ **Composite key validation** - Validates that combinations of multiple child items together match corresponding combinations in parent categories
- ✅ **Operation expression validation** - Parses and validates complex operation expressions like `(1-60)`, `(1,2,5)`, `(X0)(1-5,11-15)`
- ✅ **Duplicate category and item detection** - Reports when a category or item is duplicated (in loop or frame format)
- ✅ Supports local dictionary files or downloading from URL (works with PDBx/mmCIF dictionary or any CIF dictionary format)
- ✅ **Enhanced JSON output** - Includes precise character positions and column indices for programmatic error handling
- ✅ **Exit codes** - Returns 0 for success, 1 for errors (useful for CI/CD integration)
- ✅ **Metadata completeness in JSON** - When validation runs, the JSON output includes an optional `metadata_completeness` object (percentage, filled/total counts, detected method, missing categories, missing items with row/key and validation-error flag)

## Installation

### Prerequisites

- Python 3.7 or higher (uses only Python standard library, no pip packages required)
- Internet connection (optional) - Only needed if downloading dictionary from URL. Can use local dictionary file for offline use.
- CIF dictionary file (optional) - Defaults to PDBx/mmCIF dictionary from URL, but can use any CIF dictionary format

## Usage

### Basic Usage

```bash
# Dictionary source can be a file path or URL (auto-detected)
# Works with PDBx/mmCIF dictionary or any CIF dictionary format
python validate_mmcif.py <dictionary.dic or URL> <mmcif_file.cif>
```

### Using Local Dictionary File

```bash
# Use PDBx/mmCIF dictionary
python validate_mmcif.py mmcif_pdbx_v5_next.dic 6qvt.cif

# Or use any CIF dictionary file
python validate_mmcif.py path/to/your/cif_dictionary.dic your_file.cif
```

### Using Dictionary from URL

```bash
# Using --url option (explicit) - defaults to PDBx/mmCIF dictionary
python validate_mmcif.py --url http://mmcif.pdb.org/dictionaries/ascii/mmcif_pdbx.dic 6qvt.cif

# Or use any CIF dictionary URL
python validate_mmcif.py --url https://example.com/path/to/your/dictionary.dic 6qvt.cif

# Or as positional argument (auto-detects URL)
python validate_mmcif.py http://mmcif.pdb.org/dictionaries/ascii/mmcif_pdbx.dic 6qvt.cif
```

### Explicit Options

```bash
# Use local file
python validate_mmcif.py --file mmcif_pdbx_v5_next.dic 6qvt.cif

# Use URL
python validate_mmcif.py --url http://mmcif.pdb.org/dictionaries/ascii/mmcif_pdbx.dic 6qvt.cif
```

### Help

```bash
python validate_mmcif.py --help
```

## Library usage

You can use the validator as a Python library (e.g. in prerelease pipelines or other tools) by importing and calling the same logic the CLI uses. The library raises exceptions instead of exiting, so callers can handle errors.

### Install

From the project (e.g. after cloning or from a wheel):

```bash
pip install -e /path/to/mmcif-validator/vscode-extension/python-script
# or from that directory:
pip install -e .
```

Or install from PyPI (when published):

```bash
pip install pdbe-mmcif-validator
```

When installed via pip, a `validate-mmcif` console script is also available:

```bash
validate-mmcif --file mmcif_pdbx_v5_next.dic file.cif
validate-mmcif --help
```

### Basic usage

```python
from pathlib import Path
from validate_mmcif import validate, ValidatorFactory, ValidationError
from validate_mmcif import DictionaryNotFoundError, CifNotFoundError, DownloadError

# Option 1: top-level function (recommended)
try:
    errors = validate(Path("mmcif_pdbx_v5_next.dic"), Path("file.cif"))
    for err in errors:
        print(err.line, err.item, err.message, err.severity)
    if not errors:
        print("Validation passed.")
except DictionaryNotFoundError as e:
    print("Dictionary not found:", e)
except CifNotFoundError as e:
    print("mmCIF file not found:", e)
except DownloadError as e:
    print("Download failed:", e)

# Option 2: factory (same behaviour)
errors = ValidatorFactory.validate(Path("dict.dic"), Path("file.cif"))
```

### Using a dictionary from a URL

Download the dictionary first, then validate:

```python
from pathlib import Path
from validate_mmcif import validate, download_dictionary, DownloadError

try:
    dict_path = download_dictionary("http://mmcif.pdb.org/dictionaries/ascii/mmcif_pdbx.dic")
    errors = validate(dict_path, Path("file.cif"))
    # ... use errors ...
finally:
    if dict_path.exists():
        dict_path.unlink()  # clean up temp file
except DownloadError as e:
    print("Download failed:", e)
```

### Integrating with your logging

The module uses the standard `logging` logger `validate_mmcif`. Configure logging so library messages go to your logs:

```python
import logging
logging.basicConfig(level=logging.DEBUG)
# or attach to your app's logger:
logging.getLogger("validate_mmcif").setLevel(logging.INFO)
```

### Exceptions

| Exception | When it is raised |
|-----------|-------------------|
| `DictionaryNotFoundError` | The dictionary path does not exist. |
| `CifNotFoundError` | The mmCIF file path does not exist. |
| `DownloadError` | Downloading the dictionary from a URL failed. |

All of these inherit from `MmCIFValidatorError`, so you can catch that for any validator error.

### Return value

`validate()` and `ValidatorFactory.validate()` return a list of `ValidationError` dataclass instances with:

- `line`, `item`, `message`, `severity` (`"error"` or `"warning"`)
- `column`, `start_char`, `end_char` (optional, for positioning)

## Output

The script outputs:
- Validation errors and warnings with line numbers
- JSON output for programmatic use (includes optional `metadata_completeness` when validation runs). The `metadata_completeness` object combines method-specific mandatory categories from the `completeness/` lists with deposition-mandatory items from the dictionary, and includes special handling for certain category groups (for example, entity-source categories from `entity_src_cat.list` are treated as satisfied when at least one of them is present).
- Exit code 0 for success, 1 for errors

Example output:
```
Parsing dictionary: mmcif_pdbx.dic
Loaded 6652 items from dictionary
Parsing mmCIF file: model.cif
Found 1124 items in mmCIF file
Validating...

Found 4 validation issue(s):

ERROR: Line 36, Item '_pdbx_database_status.recvd_initial_deposition_date'
  Value '20250601' does not match expected type 'yyyy-mm-dd'

ERROR: Line 1643, Item '_pdbx_struct_assembly_gen.oper_expression'
  Operation expression '1' references operation ID '1' which does not exist in '_pdbx_struct_oper_list.id'

WARNING: Line 1011, Item '_refine.ls_R_factor_obs'
  Out of advisory range: Value '0.350' is above advisory maximum '0.300'

ERROR: Line 1020, Item '_refine.ls_R_factor_obs'
  Value '1.250' is above maximum allowed value '1.000'
```

### JSON Output Format

The script outputs JSON at the end with the following structure:

```json
{
  "errors": [
    {
      "line": 1238,
      "item": "_refine_ls_shell.number_reflns_R_free",
      "message": "Out of advisory range: Value '0' is below minimum advised value '1'",
      "severity": "warning",
      "column": 5,
      "start_char": 43,
      "end_char": 44
    }
  ]
}
```

**Fields:**
- `line`: Line number (1-based) where the error occurs
- `item`: The item name (e.g., `_refine_ls_shell.number_reflns_R_free`)
- `message`: Human-readable error message
- `severity`: Either `"error"` or `"warning"`
- `column`: Global column index (0-based) within the row (for loop data) or `null` for non-loop items
- `start_char`: Character start position (0-based) within the line for precise highlighting, or `null` if not available
- `end_char`: Character end position (0-based) within the line for precise highlighting, or `null` if not available

The output may also include a **`metadata_completeness`** object (used by the VSCode extension): `percentage` (0–100), `filled_count`, `total_count`, `method_detected` (xray/em/nmr or null), `message` (e.g. when method is unknown), `missing_categories` (list of category names), and `missing_items` (list of `{ category, item, row_index?, row_key?, has_validation_error? }`). Items with a validation error are counted as not filled and have `has_validation_error: true`.

The `start_char` and `end_char` fields enable precise highlighting of the exact problematic value, even when the same value appears multiple times on a line.

## Validation Checks

The validator performs the following checks:

1. **Item Definition**: Verifies that items used in the mmCIF file are defined in the dictionary
2. **Mandatory Items**: Checks that all mandatory items are present (only for categories that exist in the file)
3. **Enumeration Values**: Validates that item values match allowed enumerations (reported as errors)
   - Handles enumerations with only `_item_enumeration.value` (no detail field)
   - Handles enumerations with both `value` and `detail` fields
4. **Data Type Validation**: Validates that values match their expected data types:
   - **Regex patterns from dictionary** - Automatically validates any type code that has a regex pattern defined in `_item_type_list.construct` (e.g., `email`, `phone`, `orcid_id`, `pdb_id`, `fax`, etc.)
   - **Hardcoded validations** for common types:
     - Date formats: `yyyy-mm-dd`, `yyyy-mm-dd:hh:mm`, `yyyy-mm-dd:hh:mm-flex`
     - Numeric types: `int`, `positive_int`, `float`, `float-range`
     - Boolean type: `boolean`
5. **Range Validation**: Checks that numeric values fall within specified minimum/maximum ranges
   - **Strictly Allowed Boundary Conditions** (`_item_range`): Violations are reported as **errors**
   - **Advisory Boundary Conditions** (`_pdbx_item_range`): Violations are reported as **warnings** with "Out of advisory range:" prefix
6. **Parent/Child Category Validation**: 
   - Verifies that when a child category is present, its parent category is also present
   - Example: If `entity_src_nat` (child) is present, `entity` (parent) must also be present
7. **Foreign Key Integrity**: Validates that foreign key values in child items exist in their parent items
   - Example: `_entity_src_nat.entity_id` values must exist in `_entity.id`
8. **Composite Key Validation**: Validates that combinations of multiple child items together match corresponding combinations in parent categories
   - Example: In `pdbx_entity_poly_domain`, the combination of `begin_mon_id` + `begin_seq_num` must match a row in `entity_poly_seq` where `mon_id` + `num` appear together as a pair
   - Validates relationships where multiple items form a composite foreign key (identified by `link_group_id` in the dictionary)
   - **Special handling for label/auth field combinations**: Categories like `struct_conn`, `pdbx_struct_conn_angle`, `geom_*`, `atom_site_anisotrop`, and others have composite keys that include both label and auth fields. The validator intelligently handles these by:
     - First attempting validation using label fields (if complete)
     - Falling back to auth fields when label fields are incomplete (e.g., when `label_seq_id` is missing)
     - Using `label_atom_id` when `auth_atom_id` is not present in the file
     - This ensures atoms referenced in these categories are properly validated against `atom_site` even when some fields are missing
9. **Operation Expression Validation**: Validates `oper_expression` values that reference operation IDs
   - Parses complex operation expressions: `(1)`, `(1,2,5)`, `(1-4)`, `(1,2)(3,4)`, `(X0)(1-5,11-15)`
   - Validates that all referenced operation IDs exist in `_pdbx_struct_oper_list.id`
   - Example: If `oper_expression` is `(1-60)`, validates that operation IDs 1 through 60 all exist
10. **Category-aware validation**: Only checks mandatory items for categories that are actually present in the mmCIF file
11. **First data block only**: By default, only validates the first data block in files containing multiple data blocks (each starting with `data_`)

## Adding New Rule Groups

Additional cross-check rule groups are managed by the rule engine and registry under `rules/`.

### Steps

1. Create a new rule-group class in `rules/` (JSON-backed grouped rule files are preferred for large imported families).
2. Add it to `RULE_GROUP_REGISTRY` in `rules/engine.py` with a unique ID.
3. Toggle it in `rules/rule_groups.json` as needed.

### Example

If your new rule group ID is `geometry_consistency`, your config can look like:

```json
{
  "enabled_rule_groups": [
    "imported_cross_checks",
    "geometry_consistency"
  ],
  "disabled_rule_groups": []
}
```

If `enabled_rule_groups` is omitted, all registered rule groups are enabled by default.

### Grouped JSON rule format

Imported cross-check families are stored as grouped JSON files under `rules/data/` using a shared prefix:

- `cross_checks_pairwise_comparison.json`
- `cross_checks_pairwise_date_order.json` — same-category date/datetime ordering (chronological constraints, currently `pdbx_database_status`)
- `cross_checks_uniqueness.json` — duplicate detection for configured key columns within one category (e.g. `entity.id`)
- `cross_checks_linked_presence_and_comparison.json`
- `cross_checks_conditional_required.json` — rules may include **`skip_if_any_category_present`** (list of category names): when any listed category has at least one row, the rule is skipped (e.g. skip legacy `refine.pdbx_starting_model` when `pdbx_initial_refinement_model` is present).
- `cross_checks_conditional_regex.json`
- `cross_checks_conditional_enumeration.json`
- `cross_checks_conditional_category_item.json`
- `cross_checks_required_if_any_present.json`
- `cross_checks_cross_reference_full.json`
- `cross_checks_procedural_validators.json` — procedural checks (data-driven `kind` values: wavelength vs protocol including empty list, accession format rules, conditional accession/source rules on `pdbx_initial_refinement_model`, sequence predicate warnings on `entity_poly.pdbx_seq_one_letter_code`). See extension **CHANGELOG** `[0.1.91]` for the full list.

Keeping related checks together in a consistent JSON family format makes future re-imports and diff reviews simpler.

### Maintaining grouped cross-check data

Add or change rules by editing the JSON files under `rules/data/`. Each file is loaded by **`ImportedCrossChecksRuleGroup`** in `rules/imported_cross_checks.py` (or the shared rule engine where registered). Prefer small, reviewable JSON diffs and matching runtime tests under `testing/cif_files/`.

### Roadmap (cross-check families)

- **Date/time (in progress)**: `cross_checks_pairwise_date_order.json` covers same-row ordering on `_pdbx_database_status`. Next steps are more rows in that file (other date pairs) or additional categories such as `_database_PDB_rev` and `_audit_*` once row alignment rules are clear.
- **Uniqueness**: first slice shipped as `cross_checks_uniqueness.json` (`entity`, `struct_asym`, `entity_poly`). Extend with `key_items` lists for composite keys (for example several `atom_site` columns) where file-level uniqueness is required.
- **Taxonomy / source**: conditional or linked rules across `entity` / `entity_src_*` without external NCBI calls, unless you later bundle a taxonomy table.

## Error vs Warning Severity

The validator reports issues with different severity levels:

### Errors (Red Underline)
These are violations of mandatory constraints that must be fixed:

- **Missing Mandatory Items**: Required items that are missing from categories present in the file
- **Enumeration Violations**: Values that don't match the controlled vocabulary/enumeration list
- **Data Type Mismatches**: Values that don't match their expected data type (e.g., invalid date format, non-numeric value for integer type)
- **Strictly Allowed Range Violations** (`_item_range`): Values outside the strictly allowed boundary conditions
- **Parent Category Missing**: Child categories present but their required parent categories are missing
- **Foreign Key Integrity Violations**: Foreign key values that don't exist in their parent items
- **Composite Key Violations**: Combinations of multiple child items that don't match corresponding combinations in parent categories (including label/auth field combinations)
- **Invalid Operation Expression References**: Operation expressions referencing operation IDs that don't exist

### Warnings (Yellow Underline)
These are advisory issues that may indicate problems but are not strictly required:

- **Undefined Items**: Items used in the mmCIF file that are not defined in the dictionary (only for items not starting with `_`)
- **Advisory Range Violations** (`_pdbx_item_range`): Values outside the advisory boundary conditions (but within allowed range)

## Command-Line Options

- `--file, -f`: Path to local dictionary file (.dic)
- `--url, -u`: URL to download dictionary from
- Positional arguments: Dictionary source (auto-detects file path or URL) and mmCIF file

## Examples

```bash
# Validate with local dictionary
python validate_mmcif.py mmcif_pdbx_v5_next.dic 6qvt.cif

# Validate with online dictionary
python validate_mmcif.py --url http://mmcif.pdb.org/dictionaries/ascii/mmcif_pdbx.dic 6qvt.cif

# Output to file
python validate_mmcif.py mmcif_pdbx_v5_next.dic 6qvt.cif > validation_results.txt
```

## Troubleshooting

### Dictionary file not found
- Check that the file path is correct
- Use absolute paths if relative paths don't work
- Or use the `--url` option to download from the internet

### Validation script errors
- Ensure Python 3.7+ is installed: `python --version`
- Check file paths are correct
- Verify dictionary file format is correct
- For large files, validation may take time. When using the VSCode extension, the validation timeout is configurable in settings (default 60 seconds, max 600); increase `mmcifValidator.validationTimeoutSeconds` if you see "Validation timed out".

### Python not found
- Make sure Python is in your PATH
- Or use the full path to your Python executable, e.g. `python3 validate_mmcif.py ...` or on Windows `C:\Python39\python.exe validate_mmcif.py ...`

## Limitations

- Dictionary parsing is simplified and may not handle all dictionary features
- Large dictionary files may take time to parse
- Some advanced validation rules may not be implemented yet
- **Note**: Some "missing mandatory item" errors may be false positives. In the mmCIF dictionary, items are often mandatory only when their parent category is present. The current implementation checks mandatory items only for categories that exist in the file, which should reduce false positives.
- **Note**: Some foreign key validation errors may be false positives if relationships are optional or conditional. The validator checks all defined parent/child relationships from `_pdbx_item_linked_group_list`.
- **Note**: For categories with both label and auth fields (like `struct_conn`), the validator will attempt to validate using label fields first, then fall back to auth fields if label fields are incomplete. This ensures proper validation even when some fields are missing (e.g., when `label_seq_id` is "." for non-polymer entities).
- Data type validation uses regex patterns from the dictionary when available. Types like `email`, `phone`, `orcid_id`, `pdb_id`, etc. are automatically validated if they have regex patterns defined in `_item_type_list.construct`. Types without regex patterns fall back to hardcoded validation (dates, int, positive_int, float, float-range, boolean) or are accepted without format validation.

## Implemented Features

The validator currently implements comprehensive validation including:

- [x] **Data type validation** - Validates int, float, date, etc., plus automatic validation of any type with regex pattern in dictionary (email, phone, orcid_id, pdb_id, fax, etc.)
- [x] **Range validation** - Checks min/max values from dictionary constraints
- [x] **Parent/child category validation** - Validates category hierarchies
- [x] **Foreign key integrity validation** - Ensures referenced data exists
- [x] **Composite key validation** - Validates that combinations of multiple child items match corresponding combinations in parent categories
- [x] **Label/auth field composite key handling** - Special validation for categories with both label and auth fields (struct_conn, geom_*, etc.) that intelligently falls back between label and auth fields when some values are missing
- [x] **Operation expression validation** - Parses and validates complex `oper_expression` values
- [x] **Regular expression validation** - Automatically extracts and uses regex patterns from dictionary for type validation
- [x] **Conditional relationship validation** - Entity type-based validation for polymer/non-polymer relationships
- [x] **Loop structure parsing** - Parses and validates loop structures in mmCIF files
- [x] **Category key extraction** - Extracts and uses category keys from dictionary definitions

## License

MIT

## Author

Deborah Harrus, Protein Data Bank in Europe (PDBe), EMBL-EBI

## Related

This script is part of the [PDBe mmCIF Validator](https://github.com/PDBeurope/mmcif-validator) project, which also includes a Visual Studio Code extension for real-time validation.

