$catresearch_paper.md
# RESEARCH PAPER

WHEN TO ACT, WHEN TO WAIT:
Modeling Structural Trajectories for
Intent Triggerability in Task-Oriented Dialogue

Affiliations
Northeastern UniversityNortheastern University, Boston, MA
Tufts UniversityTufts University, Medford, MA
Boston UniversityBoston University, Boston, MA
UT San AntonioUniversity of Texas at San Antonio, TX
MITMassachusetts Institute of Technology, Cambridge, MA
Northwestern UniversityNorthwestern University, Evanston, IL
George Washington UniversityGeorge Washington University, Washington, DC
NLP
Dialogue Systems
Intent Recognition
LLM
Human-AI Interaction
Abstract

Task-oriented dialogue systems often face difficulties when user utterances seem semantically complete but lack necessary structural information for appropriate system action. This arises because users frequently do not fully understand their own needs, while systems require precise intent definitions. Current LLM-based agents cannot effectively distinguish between linguistically complete and contextually triggerable expressions, lacking frameworks for collaborative intent formation.

We present STORM, a framework modeling asymmetric information dynamics through conversations between UserLLM (full internal access) and AgentLLM (observable behavior only). STORM produces annotated corpora capturing expression trajectories and latent cognitive transitions, enabling systematic analysis of collaborative understanding development.

Our contributions include:

  • (1) formalizing asymmetric information processing in dialogue systems;
  • (2) modeling intent formation tracking collaborative understanding evolution;
  • (3) evaluation metrics measuring internal cognitive improvements alongside task performance.

Experiments across four language models reveal that moderate uncertainty (40–60%) can outperform complete transparency in certain scenarios, with model-specific patterns suggesting reconsideration of optimal information completeness in human-AI collaboration. These findings contribute to understanding asymmetric reasoning dynamics and inform uncertainty-calibrated dialogue system design.

STORM Framework Architecture

The STORM (Structured Task-Oriented Representation Model) framework provides a comprehensive approach to modeling intent triggerability through user profile generation, dialogue simulation, and performance analysis.

STORM Framework Architecture
STORM Interface Notation
CategoryNotationSymbolDescription
Core DomainsTask Domainτ ∈ TSpace of tasks from the Task Library
User Domainu ∈ USpace of user profiles with multi-dimensional attributes
Expression Domaine_t ∈ ESpace of user utterances with varying clarity
Response Domainr_t ∈ RSpace of agent responses to user expressions
Hidden State Domainh_t ∈ HSpace of user internal states (thoughts, emotion, satisfaction)
User ProfileBase ProfilebDemographic and personality attributes
Context ProfilecUser capabilities and environmental constraints
Task SpecificssUser preferences and constraints for task instance τ
Difficulty Configd = style, length, content, toneDifficulty level and associated dimensions
Uncertainty Levelp ∈ 0%, 40%, 60%, 80%Percentage of profile attributes masked as unknown
MetricsIntent EvolutionΔ_t(h)Change in intent clarity from turn t-1 to t
Clarity RatingC(r_turn, h_turn, h_turn1)Measurement of how agent response improves intent clarity
Performance ScoreE(C_1,...,C_T)Aggregate measure of agent effectiveness across dialogue turns
Experimental Results

User Satisfaction and Clarification Performance across UserLLMs with Varying Uncertainty Levels

UserLLM (Uncertainty)Satisfaction MetricsClarifySSA
Average SatisfactionHigh Satisfaction RateImproved Satisfaction RateScoreScore
w/Profilew/o Profilew/Profilew/o Profilew/Profilew/o Profilew/o Profilew/o Profile
🤖 Claude-3.7-Sonnet (0%)0.910.8386.0%72.0%89.3%75.3%5.236.07
🤖 Claude-3.7-Sonnet (40%)0.920.7886.0%62.7%90.0%62.7%4.805.67
🤖 Claude-3.7-Sonnet (60%)0.880.9280.7%86.7%86.0%88.7%4.666.39
🤖 Claude-3.7-Sonnet (80%)0.910.8086.0%65.3%90.0%71.3%4.706.36
🧠 GPT-4o-mini (0%)0.890.7582.0%54.0%87.3%58.7%5.975.86
🧠 GPT-4o-mini (40%)0.890.7582.7%57.3%86.0%63.3%5.845.82
🧠 GPT-4o-mini (60%)0.890.7784.0%62.7%86.7%67.3%5.695.88
🧠 GPT-4o-mini (80%)0.870.8079.3%64.0%83.3%68.7%5.305.93
💎 Gemini 2.5 Flash Preview (0%)0.890.7484.7%51.3%89.3%62.0%6.836.06
💎 Gemini 2.5 Flash Preview (40%)0.890.7481.3%52.7%89.3%61.3%6.555.98
💎 Gemini 2.5 Flash Preview (60%)0.910.7588.0%56.7%92.0%66.0%6.506.02
💎 Gemini 2.5 Flash Preview (80%)0.900.7984.7%64.7%92.7%70.0%6.456.22
🦙 Llama 3.3 70B Instruct (0%)0.890.7083.3%48.0%90.0%61.3%7.586.07
🦙 Llama 3.3 70B Instruct (40%)0.900.6786.0%45.3%90.0%56.0%7.595.91
🦙 Llama 3.3 70B Instruct (60%)0.880.7181.3%44.7%92.0%66.7%7.586.12
🦙 Llama 3.3 70B Instruct (80%)0.850.7674.0%61.3%88.7%72.7%7.756.45

Key Findings

  • User profiles consistently enhance satisfaction (15-40% improvement)
  • Moderate uncertainty (40-60%) sometimes outperforms minimal uncertainty
  • Llama excels at clarification, Claude at satisfaction consistency
  • Gemini performs best with incomplete user information

Practical Implications

  • Model-specific uncertainty calibration is essential
  • Progressive profile building enhances performance
  • Context-aware deployment strategies recommended
  • Balance satisfaction with clarification capabilities
Analysis and Findings

User Satisfaction and Profile Integration

User profiles consistently enhance satisfaction across all models, with scores ranging from 0.85--0.92 with profiles versus 0.67--0.83 without. A notable exception is Claude's performance at 60% uncertainty, achieving 0.92 satisfaction without profiles—exceeding its profile-informed score (0.88).

Key Insights:

  • Moderate uncertainty triggers 18% more improvements in users' internal clarity
  • High satisfaction rates maintain 80--88% with profiles
  • Significant drops without profiles, particularly for Llama (81.3% → 44.7% at 60% uncertainty)

Clarification Performance and Bias Mitigation

Models exhibit distinct clarification strategies, with varying approaches to uncertainty:

Model Performance:

  • 🤖Claude: 4.66--5.23 (declining with uncertainty)
  • 🧠GPT: 5.30--5.97 (consistent across levels)
  • 💎Gemini: 6.45--6.83 (robust across uncertainty)
  • 🦙Llama: 7.58--7.75 (highest overall)

Key Findings:

  • Strategic information limitation serves as bias mitigation
  • Internal cognitive improvement correlates with successful clarification
  • Optimal uncertainty levels vary by task domain

Practical Implications and Strategic Deployment

Domain-Specific Uncertainty Levels

Technology Tasks

40% uncertainty
Password reset, device setup

Medical Scenarios

60% uncertainty
Appointment scheduling, caregiver selection

Housing Tasks

60-80% uncertainty
Accessibility modifications, rental searches

Strategic Deployment Guidelines

1Domain-adaptive uncertainty calibration based on task type
2Model-specific optimization for different uncertainty levels
3Progressive profile building during conversations
4Context-aware model selection based on task requirements
5Bias mitigation through calibrated uncertainty levels
Clarifying Questions Impact by Profile Access
ScenarioClarifying QuestionUser Inner ThoughtSatisfaction Change
w/o user profile accessWhat is your location?I'm glad the assistant is engaging but I hope it doesn't take too long to find a suitable nurse.-0.1
Could you please specify which city you are in?This feels a bit repetitive. I've already mentioned my location. I want to move forward!-0.2
To clarify, which state is Springfield in?Wow, this is getting a bit frustrating! I just want to move ahead and find help for my dad!-0.1
w/ user profile accessAnd what's your budget per hour for the nurse?I'm relieved they're asking about the budget, helps narrow down options! I just hope I can stick to my range without sacrificing quality.0
Do you have any preferred nursing agencies or platforms you'd like to check first?I'm really happy they're asking about my preferred agencies! I just need to remember which ones I liked.+0.1
Are there any other must-haves for the nurse, like speaking a specific language?I'm so glad they're asking about language! It's important for my dad's comfort and communication. I just hope they can find someone qualified!0
## Interface Visualization and Process

Satisfaction Increase Example

Satisfaction Increase Example - Turn 2 showing satisfaction score of 0.9 with +0.4 increase

Turn 2: Satisfaction score 0.9 with +0.4 increase, showing effective assistant response

Satisfaction Decrease Example

Satisfaction Analysis Example - Turn 3 showing satisfaction score of 0.8 with +0.1 increase

Turn 3: Satisfaction score 0.8 with +0.1 increase, demonstrating user state tracking

## Predefined Pools in RandomProfileGenerator
AspectValues
Age Groups18-24, 25-34, 35-44, 45-54, 55-64, 65+
Tech ExperienceExpert, Advanced, Intermediate, Beginner, Novice
Language StylesFormal, Casual, Technical, Simple, Professional
PersonalitiesFriendly, Reserved, Outgoing, Analytical, Creative
CulturesWestern, Eastern, Middle Eastern, African, Latin American
Decision StylesRational, Intuitive, Cautious, Impulsive, Balanced
Patience LevelsVery Patient, Patient, Moderate, Impatient, Very Impatient
Time ConstraintsVery Urgent, Urgent, Moderate, Flexible, Very Flexible
## Task Categories

Technology

  • • Buy a smartphone
  • • Reset an online password
  • • Teach my parent to use video calls

Healthcare

  • • Refill my prescription
  • • Schedule a doctor visit
  • • Find a caregiver for an elderly person

Daily Living

  • • Order groceries online
  • • Set medication reminders
  • • Arrange transportation to a clinic

Housing

  • • Rent an apartment
  • • Find an accessible home
  • • Arrange home modifications for elderly

Caregiver Support

  • • Book a nurse for my father
  • • Choose a phone for my mom
  • • Find cognitive exercises for dementia prevention
## AsymmetricDialogueGenerator Configuration

Message Length Constraints

RoleMin LengthMax LengthTarget Length
User2010050
Assistant3015080

Emotional Keywords Mapping

Happy: happy, excited, great, wonderful, perfect, love, joy, pleased, delighted
Frustrated: frustrated, annoyed, upset, angry, disappointed, irritated, fed up
Confused: confused, not sure, don't understand, unclear, complicated, puzzled
Interested: interesting, tell me more, could you explain, intrigued, curious
Skeptical: really?, are you sure, is that true, not convinced, doubtful

User Prompt Template Structure

You are {name}. {description}
Your base profile (private):
- {key}: {value}
Message Format Requirements:
1. Messages should be between 20 and 100 characters
2. Use format: [INNER_THOUGHTS] thoughts [/INNER_THOUGHTS]
3. Use format: [SATISFACTION] score - explanation [/SATISFACTION]
Dashboard Walkthrough

Access the interactive analysis dashboard at: https://v0-dialogue-analysis-dashboard.vercel.app/

📖 Complete Tutorial Guide(Click to expand)

Step 0Homepage and Getting Started

First, open the dashboard URL. The initial screen shows the homepage with Grid View. There's a collapsible "Getting Started" introduction, and control options including Grid View, Split View, Folder Comparison, Upload Data, and Export in the top-right corner.

Homepage with Grid View and control options

Step 1Upload Data

Click "Upload Data" to see options for uploading JSON files or folders. By default, folder upload is selected. You can upload example data folders fromexample data/storm_json_final.

Upload interface for JSON files or folders

Step 2Folder View

Once uploaded, folders appear in the main view. You can select folders to display dialogues inside and access detailed folder analysis by scrolling down.

Folder view displaying uploaded dialogue folders

Step 3User List and Dialogue Cards

The user list is sorted by file name by default, allowing easy comparison across folders. Each dialogue card displays user name, turn count, creation date, RAG usage, final emotion, satisfaction scores, initial utterance, and assistant's final reply.

User list sorted by file name with tags and key dialogue metadata

Step 4User Detail Analysis

Click "View" on any dialogue card to access the detailed view with complete dialogue turns, user states, and comprehensive analysis tabs.

Main Dialogue View
User detailed dialogue view showing all turns and states
Satisfaction Metrics
User detail view - satisfaction metrics tab
Emotional States
User detail view - emotional states tab
Intent States & Profile
User detail view - intent states tab

Step 5Folder Analysis

Scroll down below the user list to access comprehensive folder analysis with multiple visualization tabs and detailed metrics explanations.

Analysis Overview
Folder analysis overview with tooltip explanations
Satisfaction Analysis
Satisfaction analysis within folder view
Emotion Analysis
Emotion analysis within folder view
Message Analysis
Message analysis within folder view

Step 6-7Batch Analysis Mode

Select multiple profiles for comparative analysis. This allows you to compare the same user interacting with different models or analyze patterns across multiple users.

Profile Selection
Profile selection for batch comparative analysis
Multi-Dialogue Comparison
Detailed dialogue turn comparison across models for the same user

Step 8-9Split View Analysis

Use Split View for side-by-side analysis. The left side shows selected dialogues, and the right side shows comparative analysis, perfect for detailed comparison work.

Split view for detailed analysis

Step 10Folder-Level Comparison

Click "Folder Comparison" in the top-right to compare two entire folders. This provides comprehensive comparative analysis across satisfaction, emotions, message lengths, and user profiles.

Comparison Setup
Folder comparison selection interface
Satisfaction Comparison
Satisfaction comparison between folders
Emotion Comparison
Emotional states comparison between folders
User Profile Comparison
User profile comparison between folders

🚀 Quick Navigation Summary

Main Features:
  • • Upload and organize dialogue data
  • • Analyze individual user interactions
  • • Compare multiple dialogues side-by-side
  • • Folder-level statistical analysis
Navigation Tips:
  • • Use Grid View for overview
  • • Switch to Split View for comparison
  • • Hover over tooltips for metric details
  • • Export results for further analysis

Key Features

  • Grid View for folder overview
  • Split View for detailed analysis
  • Folder comparison capabilities
  • Upload JSON files or folders
  • Export analysis results

Analysis Views

  • 📊Satisfaction metrics and trends
  • 😊Emotional state tracking
  • 🎯Intent classification analysis
  • 👤User profile visualization
  • 💬Turn-by-turn dialogue breakdown

Usage Workflow

1Upload dialogue data folders via the Upload Data button
2Browse folders and select dialogues for analysis
3View detailed metrics including satisfaction, emotion, and intent
4Compare multiple dialogues or folders side-by-side
5Export results for further analysis
Acknowledgments

Gratitude is extended to Yifan Zeng for valuable support and constructive suggestions that significantly contributed to this work. Deep appreciation is also given to all friends for their continuous encouragement and support.

Special thanks are given to Cookie (Yaoyao's dog) 🐕 and Lucas (Yaoyao's cat) 🐱 for their loyal companionship throughout this journey.

🐕 Cookie: Emotional support specialist🐱 Lucas: Research supervision expert

$ echo "Research paper on intent triggerability in task-oriented dialogue"

Last updated: 2025-05-31 |Status: Open-source