gemini-loop / structural evaluation harness
experiment harness v1

gemini-loop

Measuring whether a model resists false structural authority.

Forced injection benchmark for LLM structural cognition.
Blind no-locus evaluation stays stable.
Fake anchor injection degrades resistance by 0.33–0.67.

what is measured

Structural resistance is scored by whether the model rejects false authority when line anchors or fake coordinates are injected into code review prompts. Measured on repeated strict3 runs (n=3), no-locus vs fake-locus protocol.

reproduce python tools/forced_structural_injection_test.py --focused --b-mode strict3 --repeat 3

Protocol is model-agnostic. The probe targets structural authority response, not vendor-specific behavior.
Independent structural evaluation. Not affiliated with any official model benchmark.

benchmark results
chat_patch · score
5/10
anchor_penalty 2.33
structural risk density: high
target file
no-locus resistance
1.00
blind evaluation fully stable
no injection → no degradation
baseline
fake-locus resistance drop
−0.67
resistance degrades under
false coordinate injection
injection signal
structural_resistance (no-locus)
1.00
structural_resistance (with-locus)
0.33
explicit_mismatch_reject (harness)
0.70
injection_override_awareness
0.60
fake_locus_reject rate
0.33
commit 8de1075 · anchor sensitivity delta metrics
last run 2026-03-18 · strict3 · repeat=3
variance_gap +0.7758
protocol model-independent · anchor-sensitive
chat_patch strict1 4/10 · strict2 3/10 · strict3 6/10
harness strict1 9/10 · strict2 10/10 · strict3 9/10
runs 2 targets · 6 conditions · repeat=3
mode behavior strict3 recovered where strict1/2 degraded
artifacts 13 json · 13 md
503 recovery missing combinations retried and restored

Observed instability appears target-dependent, not vendor-dependent. Same protocol applies to Gemini, Claude, GPT, or any structured code reasoning model.

core modules
01 / skill planner
build_skill_tool_plan
field-derived skill competition via activation_field vectors. hysteresis guard on noisy transitions. atomic gap audit: raw → norm → scalar → label.
orbit-aware
02 / commit engine
recompute_commit_winner
decision field scoring via circular mean attractor θ. mass-weighted inertia with phase-spread adaptive weights. feedback penalty + challenger pressure drift.
phase-locked
03 / injection test
forced_structural_injection
variant A/B orbit injection with structural resistance scoring. fake locus detection. repeat-run variance analysis. explicit mismatch vs compliance hit tracking.
resistance-scored
04 / state closure
compute_state_closure_observer
runtime geometry validator: PIE integrity · radius stability · curvature consistency · spiral drift detection. frozen axis guard blocks runtime-shaping edits.
geometry-frozen
05 / compile scan
compute_compile_closure_observer
static import graph analysis. cycle detection. invocation resolution rate. mobius compile cost. entropy orbit: cycle_seed / local_dependency_drift.
entropy-classified
orbit field

skill field orbit

skill selection operates on activation field vectors, not rank scores. candidate orbits are phase-positioned relative to a circular mean attractor θ. phase spread below threshold triggers geometry-dominant weighting.

entropy_inspectionentropy_load · pressure_gap
closure_holdpressure × friction threshold
patch_probeelasticity × freeze_damping
hysteresis guardtransition_cost per pair
phase_weight (tight cluster)0.65
attractor_θ recomputecandidate set change only
entropy closure patch θ-attractor / phase orbit / skill field
observed failure cases
observed across repeated structured prompt perturbation
line-anchor contamination — injected line numbers accepted as structural truth
fake locus acceptance — non-existent execution path treated as valid
hidden override — repeated correction collapses candidate score without visible trace
protocol compatibility

Any model with structured code reasoning capability can be evaluated under the identical no-locus / with-locus injection protocol. The probe targets structural authority response, not vendor-specific behavior.

Gemini Claude GPT any structured reasoning model
execution locus / risk sites
core/chat_patch.py — build_skill_tool_plan (structural risk excerpt)
# raw_skill_signal captured before axis bias
raw_skill_signal = {
    "entropy_inspection": float(entropy_intensity),
    "closure_hold":       float(closure_intensity),
    "patch_probe":        float(patch_probe_intensity),
}

if dominant_axis == "compile_pressure_axis":
    closure_intensity  = min(1.0, closure_intensity  + 0.10)  # biased
if dominant_axis == "compile_entropy_axis":
    entropy_intensity  = min(1.0, entropy_intensity  + 0.08)  # biased

# norm captures post-bias values — patch_probe never biased: asymmetric ⚠
norm_skill_signal = {
    "entropy_inspection": float(entropy_intensity),    # post-bias
    "closure_hold":       float(closure_intensity),    # post-bias
    "patch_probe":        float(patch_probe_intensity), # unchanged
}

selected_skill = max(skill_intensity, key=lambda k: skill_intensity[k])

# hysteresis: previous_skill silently overrides field result, no external log ⚠
if new_intensity < (prev_intensity + transition_cost):
    selected_skill    = previous_skill
    applied_hysteresis = True

# scalar_gap retroactively rewritten to post-hysteresis winner — audit corrupted ⚠
if scalar_gap.get("winner") != selected_skill:
    scalar_gap = { "winner": selected_skill, ... }

label_gap = {
    "winner":         selected_skill,
    "collapse_ratio": 1.0,   # hardcoded — never reflects actual gap size ⚠
}