JavaScript is Disabled
Your browser's JavaScript functionality is disabled. It has to be enabled to use this function of ConfTool.
Here you can find information on how to enable JavaScript
If you have any problems, please contact the organizers at info@ice-conference.org.

Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

Daily Overview

Session

RS-PL-3B: Generative AI & Human-in-the-Loop

Time:

Wednesday, 24/June/2026:

3:00pm - 4:20pm

Session Chair: Dr. Nahid Farhady Ghalaty, Microsoft
Session Chair: Dr. Daniel Rocha, INL - Laboratório Ibérico Internacional de Nanotecnologia

Location: Room Infante

Presentations

Self-Improving AI Coding Agents Through Accumulated Behavioral Rules: A Closed-Loop Framework

Aditya Aggarwal, Nahid Farhady Ghalaty

Microsoft, United States of America

LLM-based coding agents repeat the same classes of mistakes across sessions because they lack a mechanism to retain corrections from human review feedback. We present a closed-loop framework in which every accepted review com ment is codified as a persistent behavioral rule, progressively expanding the set of error classes the agent can self-detect. The framework combines an accumulating rule set in a version controlled instruction file, a self-review checklist executed before code submission, and automated validation that ensures rule set integrity as it grows. In deployment across a 35+ service microservices platform, the rule set grew from 5 to 18 behavioral rules, 15+ language-specific standards, and a 15-item self-review checklist, all derived from real review feedback. We present empirical results from 11 recorded working sessions spanning code generation, PR review, incident investigation, and cross service refactoring. We observe that accumulated rules shift review effort from low-level correctness toward design-level vali dation, achieve a measured 0% recurrence rate for ruled-against error classes, and transfer across heterogeneous agent interfaces. We compare our approach against related work in experiential LLM learning (Reflexion, ExpeL, Voyager) and automated code review (CodeReviewer, SWE-bench agents), showing that our framework achieves persistent cross-session learning without weight updates, operates on production codebases rather than synthetic benchmarks, and addresses an orthogonal dimension (behavioral consistency over time) that existing benchmarks do not measure. The result is a coding agent that improves with every review cycle, accumulating the engineering wisdom of its human collaborators without changing a single model weight.

Smartphone-Based Eye Tracking for Objective Fatigue Assessment in Multiple Sclerosis: System Validation and Preliminary Results

Gonçalo Tierri^1,2, Marco Martins¹, João Cerqueira^4,5, Vítor Carvalho^2,3, Daniel Rocha^1,2,3

¹INL - Laboratório Ibérico Internacional de Nanotecnologia, Portugal; ²2Ai, School of Technology, Polytechnic University of Cávado and Ave, Portugal; ³Centro ALGORITMI/LASI, Universidade do Minho, Portugal; ⁴Neurosciences Domain; Life and Health Sciences Research Institute, School of Health Sciences and ICVS/3B's Associate Laboratory, University of Minho, Braga, Portugal; ⁵Clinical Academic Centre (CCA), Hospital de Braga, Braga, Portugal

Fatigue affects up to 90% of patients with Multiple Sclerosis (MS) and is consistently rated as their most debilitating symptom. Current assessment relies on subjective scales, lacking objective quantification. Laboratory studies have demonstrated that saccadic eye movement parameters can discriminate fatigued from non-fatigued MS patients (AUC=0.857), with endogenous Posner paradigms revealing a clinically significant 16.9 ms latency difference between fatigued and non-fatigued patients. However, these findings remain confined to expensive infrared equipment. We present and validate a smartphone-based framework for measuring saccadic latency using an endogenous Posner paradigm on Android. Our hybrid native/web architecture uses a Kotlin plugin with Google MediaPipe Face Landmarker for real-time tracking, achieving a system temporal jitter of approximately 0.59 ms (RSS), more than an order of magnitude below the 16.9 ms clinical target. In a preliminary evaluation with nine healthy participants (ages 18-77), a positive age-latency gradient was observed (204.7-376.8 ms), consistent with normative literature, with participants under 25 years yielding latencies within the healthy control reference range reported by Ferreira et al. To our knowledge this is the smartphone-based framework combining an endogenous Posner paradigm with real-time eye tracking for objective fatigue assessment in MS. The framework requires no specialized hardware, and a positive validity effect was observed in eight of nine sessions, consistent with the expected attentional cueing effect of the endogenous Posner paradigm, supporting its potential for scalable clinical validation.

A Human–AI Collaboration Agentic Platform for Engineering Project Management

Ka Tai LAU, Kin Fung LEUNG, Lok Him TSE, Man Chit, Jovian CHEUNG

Electrical and Mechanical Services Department, HKSAR CHINA, Hong Kong S.A.R. (China)

The building-management sector, encompassing maintenance, repair, alteration, and facility management, currently faces systemic inefficiencies driven by fragmented data, manual processes, and weak governance. This paper presents an Agentic Platform designed as a comprehensive human–AI collaboration system for engineering project management. At its core is a unified data collection platform that aggregates multi-modal project information—including schedules, visual evidence, textual artifacts, and external communications—into a centralized data lake. The platform leverages a multi-agent orchestration framework utilizing large language models (LLMs) and prompted vision-language models (VLMs). Key innovations include an intelligent multi-format schedule ingestion engine that converts MS Project, PDF, and handwritten schedules into a universal JSON format for semantic tracking, and a rapid VLM auto-tagging system that classifies site photos into specific issues and engineering trades without manual input. Operating on a novel Propose–Decide–Evidence (PDE) interaction pattern, the system ensures that AI agents propose actions and draft documents while humans retain final decision-making authority, thereby maintaining rigorous governance and auditability. Deployed locally on an GPU using Gemma 3:27B to meet strict data residency requirements, the platform demonstrated a 36% reduction in scheduling lead time and an increase in evidence completeness from 76% to 91% during pilot testing, successfully capturing previously unrecorded site information for future AI integration.

Evaluating the Bibliographic Retrieval Accuracy of Large Language Models: A Controlled Comparison of GPT, Gemini, and Grok Across Prompt Variants

Zesheng Li, Haiyan Lu, Christy Liang, Robert Wu

University of Technology Sydney, Australia

Large Language Models (LLMs) are increasingly used by researchers to assist with literature retrieval, yet their bibliographic accuracy remains under-examined. This study presents a structured, multi-dimensional evaluation comparing the literature search outputs of three leading large language models (LLMs): OpenAI’s GPT-5.2 (Thinking mode), Google’s Gemini 3, and xAI’s Grok 4, against a human-conducted baseline search in the domain of consumer trust in digital environments. Each model was tasked with retrieving academic references under an identical structured prompt protocol. Outputs were subsequently verified across five dimensions: DOI accuracy, author correctness, journal ranking, publication year, and content relevance. Results show substantial performance differences across models. Grok 4 achieved the highest accuracy on most individual verification dimensions, including DOI match (95.7%) and content match (95.7%). However, GPT-5.2 produced the highest full-match rate (54.2%, compared with 47.8% for Grok 4) and the strongest performance relative to the human baseline. Gemini 3 performed substantially worse across all measures, with a high proportion of fabricated or unverifiable references. These findings highlight both the promise and the significant risks of hallucination when deploying LLMs for academic literature retrieval. The study contributes a replicable evaluation framework and offers practical recommendations for integrating LLMs into research workflows while maintaining bibliographic integrity.

32^nd ICE IEEE/ITMC Conference
(ICE 2026)

22 - 24 June 2026, Porto - Portugal

Conference Agenda