# Lungs Dataset

An anonymized lung cancer imaging dataset containing CT (and PET) DICOM scans alongside clinical patient metadata.

## Dataset Summary

| Attribute | Value |
|-----------|-------|
| Total patients (per patient sheet) | 457 |
| Total patients (with imaging folders) | 452 |
| Male / Female | 285 / 171 |
| Age range | 38 – 104 years (mean 68.5) |
| PET-positive cases | 103 confirmed, 5 unclear |

## Directory Structure

```
lungs_dataset/
├── PatientsTableAnonymized.xlsx   # Clinical metadata summary
└── img/
    ├── <patient_id>/              # Individual patient folders (numeric IDs)
    │   └── S0001/
    │       └── IMG-XXXX-XXXXX.dcm
    ├── LUNG_DATASET/              # Patients with paired CT + PET scans
    │   └── <patient_id>/
    │       ├── CT_1_25/
    │       │   └── S0001/
    │       │       └── IMG-XXXX-XXXXX.dcm
    │       └── PET/
    │           └── S0001/
    │               └── IMG-XXXX-XXXXX.dcm
    └── instead of venous and 5/  # Additional patient cohort (32 patients)
        └── <patient_id>/
            └── S0001/
                └── IMG-XXXX-XXXXX.dcm
```

- Patient folders: 320 top-level + 100 in `LUNG_DATASET` + 32 in `instead of venous and 5` = **452 total**
- 328 of the 457 patients in the sheet have a linked folder ID; 129 have no folder ID assigned
- DICOM slices per scan: 0 – 1,995 (average ~204)
- All images are in standard DICOM format (`.dcm`)

## Cancer Types

| Type | Count |
|------|-------|
| Adenocarcinoma (all subtypes) | 345 |
| Squamous cell carcinoma | 40 |
| Non-keratinizing squamous cell carcinoma | 29 |
| NSCLC-NOS | 16 |
| Non-NSCLC / unclear | 10 |
| Large cell carcinoma | 1 |
| Adenosquamous carcinoma | 1 |

## Biomarkers

| Biomarker | Count |
|-----------|-------|
| TTF1 | 308 |
| CK7 | 242 |
| NAPSIN | 113 |
| NAPSIN A | 89 |
| P63 | 83 |
| CK5 | 48 |
| P40 | 44 |
| CK AE1/AE3 | 30 |
| KI67 | 21 |
| Not given | 18 |

## Notes

- All patient identifiers have been anonymized.
- The `LUNG_DATASET` subfolder contains paired PET/CT scans for a subset of patients, useful for multimodal analysis.
- The `instead of venous and 5` subfolder contains 32 additional patients with CT scans only.
- Some patients may have zero-slice sessions (empty folders).
