Student Projects

LoRa-EmoGen

Author: Allison Ko

Summary: A project using LoRa to fine-tune GPT2 for emotion-specific text generation, achieving slightly better accuracy than base GPT2.

Findings:

  • LoRa-EmoGen achieved 70% and 50% accuracy for joy and sadness respectively vs. GPT2’s 60% and 40%
  • Generated texts from LoRa-EmoGen more frequently used first-person expressions
  • Model sometimes produced repetitive loops requiring hyperparameter adjustments
  • Showed generalization to unseen emotions like ‘excitement’

Comparing Parameter-efficient Fine-tuning Methods for Cross-lingual Content Moderation

Author: Benjamin Pong

Summary: The study evaluates LoRA, AdaLoRA and DoRA on cross-lingual content moderation tasks using PolyGuard datasets. It finds that AdaLoRA improves binary classification tasks but underperforms in violation identification compared to LoRA.

Findings:

  • DoRA showed competitive results but was limited by early stopping due to computational constraints.
  • AdaLoRA achieved highest F1 scores for prompt and response harm labels (0.49, 0.64) despite fewer parameters.
  • LoRA excelled in violation detection with Jaccard scores of 0.48/0.71 for prompt/response violations.
  • Full-finetuning underperformed in most subtasks.

AllerJET: A Travel Allergen Advisory Tool Using RAG and LLMs

Author: Catherine Caffey

Summary: Developed an allergen advisory tool using RAG and Gemma LLM to help travelers identify allergy risks in dishes by country, facing data quality challenges affecting accuracy.

Findings:

  • The tool effectively retrieved gluten-related Italian sauces (100% dishes but only 67% sauces due to incomplete ingredient lists)
  • French dairy sauce identification was hindered by high top_k search thresholds
  • Vietnamese shellfish allergens were missed because ‘shrimp’ wasn’t in implied allergen database
  • Data gaps caused failures like no peanut warnings for Indonesia despite existing risks

Evaluating RAG’s Role in Improving LLM Performance with Personal Pronouns

Author: Gwen Tait

Summary: Can using Retrieval-Augmented Generation enhance an LLM’s accuracy in using personal pronouns?

Approach: Using a dataset based on templates from Hossain & Dev (2022), test Meta-Llama-3.1-8b-instruct with standard pronouns (he/she/they), neopronouns (ey/ze), and fake pronouns (clo/gkuo).

Findings:

  1. Accusative and possessive dependent forms showed higher accuracy compared to other grammatical cases.
  2. Morphologically plausible fake pronouns (clo) performed better than those with no morphological basis (gkuo).
  3. The model often selected incorrect forms from the same series, indicating a tendency to stick within provided pronoun sets.
  4. Poor results were partly attributed to the batched prompt format, which may have hindered effective token prediction.

Leveraging Large Language Models for Solidity Smart Contract Generation: A Survey

Author: Natalie Robbins

Summary: Examines recent advancements in using large language models (LLMs) to generate secure and efficient Solidity smart contracts from natural language or structured inputs, addressing challenges like vulnerability mitigation and gas optimization.

Findings:

  • LLMs can significantly reduce development errors but struggle with security vulnerabilities.
  • Control-flow prompting frameworks improve code reliability through iterative refinement.
  • Hybrid approaches combining fine-tuning and prompting outperform individual methods in specialized tasks.
  • Standardized benchmarks are lacking hindering progress.
  • Intermediate formal representations (e.g., FSMs, BPMN) enhance generation fidelity.

An LLM Prompt-Based Approach for Intent and Slot Extraction

Author: Taiyi Chen

Summary: Explores using large language models with prompt engineering to achieve over 90% intent classification accuracy without training, though slot extraction remains challenging due to format consistency issues.

Findings:

  • Intent classification achieved ~93% accuracy with minimal prompting, demonstrating strong LLM capability for high-level user goal recognition.
  • Slot extraction F1 scores were significantly lower (~46% max), highlighting difficulties in token-level alignment and semantic understanding without supervision.
  • Hard-constrained output formats improved parseability but increased latency, showing a trade-off between control and efficiency.
  • Prompt design sensitivity: small wording changes affected performance; explicit slot explanations boosted precision over recall.

Survey of 2 Parameter Efficient Fine Tuning Techniques for Large Language Models

Author: Madhav Mahesh Kashyap

Summary: Compares LoRA and Adapters, parameter-efficient techniques reducing computational costs of fine-tuning large language models.

Findings:

  • LoRA freezes base weights and uses low-rank matrices for updates, achieving up to a 1% parameter reduction with minimal performance loss compared to full fine-tuning.
  • Adapters inject modular bottleneck modules into layers, requiring only 0.5-3% of model parameters and enabling task-specific isolation without forgetting.
  • Both methods reduce storage/compute needs while preserving inference speed; LoRA merges updates pre-inference for no latency, adapters may introduce slight latency but offer modularity.

ICL for Machine Translation with Limited Computational Resources

Author: Danielle Celone

Summary: Explores whether incorporating dependency parses in ICL improves machine translation quality using a monolingual LLM on limited hardware, testing EN-FR translations with Europarl data.

Findings:

  • The model generated nonsensical output mixing multiple languages due to insufficient French training data.
  • Runtime was excessively long (500-800 seconds per generation) on a 16GB Macbook Pro.
  • BLEU/METEOR evaluation unfeasible due to limited translations obtained; qualitative analysis revealed flawed language mixing in translations.

A Survey on Diversity in Retrieval-Augmented Generation: Methods, Metrics, and Evaluation Datasets

Author: María Paula Cortes-Lemos

Summary: Explores diversity incorporation in RAG systems through retrieval methods like MMR and DPP, evaluating metrics and datasets that assess answer completeness, fairness, and redundancy reduction.

Findings:

  • Methods like MMR and DPP balance relevance and diversity but may trade off precision.
  • Datasets (BERDS, PIR) focus on subjective/social topics needing diverse perspectives.
  • Metrics like p-recall@k measure perspective-aware retrieval, though simpler than needed.

Conversational Agents as Game Characters: A Prototype for Open-Ended Dialogue in RPGs

Author: Chenxi Li

Summary: This project explores integrating large language models (LLMs), specifically LLaMA3-8B-Instruct, into role-playing games (RPGs) to create dynamic conversational agents. It focuses on the game’s initial stage where characters receive their first mission in a town setting. The system uses prompts, LoRA fine-tuning, and lore insertion for contextually grounded responses, showing that direct lore inclusion via In-Context Learning outperforms limited fine-tuning methods.

Findings:

  • Lore-ICL provided the most accurate and aligned dialogue with 4–5 scores across questions
  • Direct prompting and small-scale LoRA fine-tuning caused severe hallucinations (scores ≤2)
  • Character personality expression was inconsistent in Lore-ICL responses despite factual correctness
  • Incorporating full lore improves coherence but struggles with reflecting character traits fully
  • Baseline models confused key plot elements like X-sharck’s identity and Crimson Spikes’ origins

Emergency Alert Information Extraction Using Llama 3 Models

Author: Yongsin Park

Summary: Developed a system using Llama 3 models to extract event type, location, sender, expiration time and URL from emergency alerts achieving 89.81% accuracy with the 3B variant after post-processing.

Findings:

  • The largest gains came from parameter-efficient fine-tuning of 3B model which achieved highest accuracy
  • Post-processing improved performance by over 50 percentage points across metrics
  • Small training data (34 examples) sufficed due to effective use of PEFT and prompting techniques
  • URL extraction was most consistently accurate while event type classification had persistent errors classifying valid categories as ‘Other’

Grammar Correction with LoRA Fine-Tuned Flan-T5

Author: Zoey Zhou

Summary: This project improves grammatical error correction by fine-tuning the Flan-T5-small model using LoRA, achieving strong performance with less computational resources compared to full fine-tuning.

Findings:

  • The fully fine-tuned model achieved the highest F0.5 score (0.420) and GLEU (0.730), outperforming baseline significantly.
  • LoRA models performed nearly as well as full fine-tuning: LoRA adapter scored 0.377 F0.5/0.706 GLEU; checkpoint 11000 got 0.388 F0.5/0.713 GLEU.
  • LoRA models made more ‘noop’ edits indicating conservatism, while full model corrected more punctuation (M:PUNCT), determiners, and verb forms.