Datasets & Tools

The TLI Lab emphasizes reproducible research artifacts—datasets, scripts, and evaluation assets that support reliable comparisons of traditional ML, transformers, and LLMs.

CyberTweetGrader&Labeler (CTGL)

CTGL is a domain-specific NLP pipeline for incident-centric social-media triage. It supports relevance scoring and prioritization of posts related to cybersecurity incidents, with a focus on healthcare and critical-infrastructure contexts.

CTGL Project Page

Planned Dataset Release

UHS 2020 Ransomware Incident (Twitter/X)

Curated posts around the 2020 Universal Health Services (UHS) ransomware incident, with collection protocols, ethics notes, and incident-response utility documentation.

Status: planned public release after manuscript acceptance / final packaging.

Benchmarking Assets

Reusable evaluation scripts and reporting templates for comparing ML baselines vs fine-tuned encoders vs prompted LLM inference under fixed partitions, including efficiency and deployment constraints.

Status: maintained internally; publishable artifact package in progress.

What Will Be Included (Artifact Checklist)

Dataset schema and documentation (fields, provenance, de-identification notes where applicable)
Collection protocol and filtering/cleaning scripts
Annotation guidelines (where used) and quality checks
Reproducible evaluation: splits, metrics, and reporting tables
Efficiency reporting hooks (latency/cost/energy where feasible)