README.md

Lungs Dataset

An anonymized lung cancer imaging dataset containing CT (and PET) DICOM scans alongside clinical patient metadata.

Dataset Summary

Attribute Value
Total patients (per patient sheet) 457
Total patients (with imaging folders) 452
Male / Female 285 / 171
Age range 38 – 104 years (mean 68.5)
PET-positive cases 103 confirmed, 5 unclear

Directory Structure

lungs_dataset/
├── PatientsTableAnonymized.xlsx   # Clinical metadata summary
└── img/
    ├── <patient_id>/              # Individual patient folders (numeric IDs)
    │   └── S0001/
    │       └── IMG-XXXX-XXXXX.dcm
    ├── LUNG_DATASET/              # Patients with paired CT + PET scans
    │   └── <patient_id>/
    │       ├── CT_1_25/
    │       │   └── S0001/
    │       │       └── IMG-XXXX-XXXXX.dcm
    │       └── PET/
    │           └── S0001/
    │               └── IMG-XXXX-XXXXX.dcm
    └── instead of venous and 5/  # Additional patient cohort (32 patients)
        └── <patient_id>/
            └── S0001/
                └── IMG-XXXX-XXXXX.dcm
  • Patient folders: 320 top-level + 100 in LUNG_DATASET + 32 in instead of venous and 5 = 452 total
  • 328 of the 457 patients in the sheet have a linked folder ID; 129 have no folder ID assigned
  • DICOM slices per scan: 0 – 1,995 (average ~204)
  • All images are in standard DICOM format (.dcm)

Cancer Types

Type Count
Adenocarcinoma (all subtypes) 345
Squamous cell carcinoma 40
Non-keratinizing squamous cell carcinoma 29
NSCLC-NOS 16
Non-NSCLC / unclear 10
Large cell carcinoma 1
Adenosquamous carcinoma 1

Biomarkers

Biomarker Count
TTF1 308
CK7 242
NAPSIN 113
NAPSIN A 89
P63 83
CK5 48
P40 44
CK AE1/AE3 30
KI67 21
Not given 18

Notes

  • All patient identifiers have been anonymized.
  • The LUNG_DATASET subfolder contains paired PET/CT scans for a subset of patients, useful for multimodal analysis.
  • The instead of venous and 5 subfolder contains 32 additional patients with CT scans only.
  • Some patients may have zero-slice sessions (empty folders).