Personality tests measure your opinion of yourself.
Under no pressure.
The best self-report instrument in the psychological literature — conscientiousness on the Big Five — explains 4% of job performance variance. The most widely used personality instrument in the world, the MBTI, gives half its test-takers a different four-letter type within five weeks. Not because they changed. Because the test was measuring something that doesn't exist at a meaningful level of precision.
This isn't a fringe critique. Nisbett and Wilson established in 1977 — in a paper cited over 2,600 times — that people have little or no direct introspective access to the cognitive processes that drive their decisions. When asked why they chose something, they confabulate. They produce plausible narratives. They don't know.
Self-report has a fundamental ceiling. Not a technical one — a theoretical one.
What you choose under constraint
is more informative than what you say.
Paul Samuelson established revealed preference theory in 1938: consumer preferences can be inferred entirely from observable choices under real constraints, eliminating the need for subjective utility reports. The core principle is simple. If you choose X when Y was affordable, you've revealed something real. No survey required.
Behavioral economics has since accumulated extensive evidence for the gap between stated and revealed preferences. Under emotional states, time pressure, or resource scarcity, people make decisions they would never have predicted from calm self-reflection. Ariely demonstrated this directly. Thaler and Sunstein showed that small changes in choice architecture dramatically shift behavior even when stated preferences remain constant.
Origin Protocol applies this logic to behavioral measurement. The simulation creates genuine constraints — scarce resources, irreversible decisions, moral conflict — that make impression management difficult and authentic behavioral expression more likely. We're not asking what you'd do. We're watching what you do.
Single observations are noisy.
Aggregated patterns are stable.
Walter Mischel's 1968 challenge to personality psychology — that trait-behavior correlations rarely exceeded r = .30 — was resolved not by defending single observations, but by understanding what aggregation does. William Fleeson's experience-sampling studies showed that while individual behavioral observations correlate with stable traits at r ≈ .25, the distribution parameters of aggregated behavior show stability of r = .91 to .97.
The noise cancels. The pattern remains. A platform measuring behavior across 21 scenarios is fundamentally different from any single behavioral observation or a 10-item self-report questionnaire.
The four phases — arrival, stakes, rupture, legacy — are designed using Trait Activation Theory (Tett & Burnett, 2003). Pressure must be trait-relevant, not merely stressful. An overly constraining situation forces everyone to respond the same way, producing no variance. Origin Protocol scenarios occupy the diagnostic middle: strong enough to suppress deliberate self-presentation, structured specifically to activate the behavioral differences being measured.
Under constraint,
impression management fails.
Kahneman's dual-process framework provides the mechanism. Under cognitive load, time pressure, or resource scarcity, System 2 — slow, deliberate, impression-managed — requires working memory it no longer has. System 1 takes over: fast, automatic, shaped by deep learning and habitual patterns. These patterns are more reflective of stable dispositional tendencies than the socially calibrated responses produced when pressure is absent.
Joshua Greene's moral cognition experiments demonstrated this directly: cognitive load selectively slowed utilitarian moral judgments while leaving deontological (rule-based, intuitive) responses unaffected. Under pressure, people default to their actual moral architecture, not their preferred self-presentation.
Mullainathan and Shafir's scarcity research adds another dimension: scarcity creates tunneling — involuntary focus on pressing constraints — that makes strategic impression management harder. Shah, Shafir, and Mullainathan (2015) found scarcity actually made people less susceptible to framing biases, because trade-off thinking became more consistent and less manipulable.
One important note: scenarios must reach moderate pressure, not maximum stress. The Yerkes-Dodson law establishes an inverted-U relationship between arousal and decision quality. Origin Protocol scenarios are calibrated for diagnostic pressure — enough to deactivate impression management, not enough to produce panic-driven responses that tell us nothing.
The measurement landscape,
honestly presented.
We're not asking you to take our word for it. Here is the validity evidence across instrument categories, drawn from published meta-analyses.
| Instrument Type | Criterion Validity | Test-Retest | Faking Resistance | Notes |
|---|---|---|---|---|
| MBTI | ~.20 (Big Five proxy) | 50% reclassification at 5 wks | Low | Fails basic psychometric standards. Missing neuroticism entirely. |
| Big Five self-report | ρ = .20–.23 (best single trait) | r ≈ .82 (2 months) | d = .50–.70 applicant inflation | Defensible instrument. Still self-report. Susceptible to faking and confabulation. |
| Observer ratings | 29–340% higher than self-report | High | High — no self-report to game | Gold standard validity. Requires trained observer. Not scalable. |
| Situational Judgment Tests | ρ = .26–.34 | r = .698 (k=37 meta-analysis) | Better than self-report | Outperforms full Big Five battery. Incremental validity over cognitive ability + Big Five. |
| Origin Protocol | SJT category + aggregation | 90-day minimum | High — no answer key | 21-scenario aggregation. Behavioral distribution stability r = .91–.97. |
Why reassessment is available
at 90 days, not 30.
The check-in interval is not arbitrary. Four artifacts contaminate any reassessment interval that is too short: item memory, practice effects, mood-state carryover, and exposure bias. The 90-day threshold is where the convergence of evidence is strongest.
What does change at 90 days is behavioral expression and decision-making repertoire — how you respond now, given what has happened since. The reassessment captures adaptation, not transformation. Fundamental personality trait change requires years to decades; rank-order stability is approximately r = .98 annually (Conley, 1984). Origin Protocol frames check-in results accordingly.
Honest about the limits.
The evidence is strong. It is not complete. We want to be clear about the boundaries of what we know.
Ego depletion is not a mechanism we rely on. Once the flagship theory supporting pressure-based assessment, it has largely failed to replicate across 36 laboratories (Vohs et al., 2021: d = 0.06, non-significant). We don't cite it, and we don't need it.
Stress responses are trainable. Meichenbaum's stress inoculation research confirms this. What pressure-based behavioral assessment reveals is your current adaptive repertoire — not an immutable disposition. Someone who has developed genuine emotional resilience through experience will respond differently than they did before that development. That is not a flaw. That is the instrument working correctly.
Game-based assessment validation remains early-stage. A 2025 meta-analysis found support for convergent validity but noted substantial heterogeneity and absence of standardized psychometric frameworks. Origin Protocol is in this emerging category. We're building the evidence base, not claiming it's complete.
Revealed preference has a ceiling. Arslan et al. (2020) found that survey-based stated risk preferences outperformed laboratory behavioral tasks in predicting real-world risk-taking — because respondents spontaneously recalled relevant past behavior when completing self-reports. The advantage of behavioral assessment is greatest when self-report cannot draw on relevant behavioral memories. Novel scenarios create exactly this condition.