Datasets & Tools
The TLI Lab emphasizes reproducible research artifacts—datasets, scripts, and evaluation assets that support reliable comparisons of traditional ML, transformers, and LLMs.
CyberTweetGrader&Labeler (CTGL)
CTGL is a domain-specific NLP pipeline for incident-centric social-media triage. It supports relevance scoring and prioritization of posts related to cybersecurity incidents, with a focus on healthcare and critical-infrastructure contexts.
Planned Dataset Release
UHS 2020 Ransomware Incident (Twitter/X)
Curated posts around the 2020 Universal Health Services (UHS) ransomware incident, with collection protocols, ethics notes, and incident-response utility documentation.
Status: planned public release after manuscript acceptance / final packaging.
Benchmarking Assets
Reusable evaluation scripts and reporting templates for comparing ML baselines vs fine-tuned encoders vs prompted LLM inference under fixed partitions, including efficiency and deployment constraints.
Status: maintained internally; publishable artifact package in progress.
What Will Be Included (Artifact Checklist)
- Dataset schema and documentation (fields, provenance, de-identification notes where applicable)
- Collection protocol and filtering/cleaning scripts
- Annotation guidelines (where used) and quality checks
- Reproducible evaluation: splits, metrics, and reporting tables
- Efficiency reporting hooks (latency/cost/energy where feasible)