Give each reference a single job
Unstable runs often happen when one image is expected to lock face identity, outfit, composition, and mood at the same time.
Decide if each reference is for identity, spatial layout, or material atmosphere. One clear goal per image improves model decisions.
Reduce conflicting signals
More references do not always improve stability. If lighting and camera distance conflict, the model receives mixed direction.
Use one to two images from similar shooting logic whenever possible. Signal consistency is more important than count.
State inheritance rules explicitly
References are input context, not instructions. Prompt text should explicitly say what to keep and what to change.
When visual and textual constraints agree, first-pass quality improves and later extension to video becomes easier.
