The Science — Origin Protocol

01 The Problem With Self-Report

Personality tests measure your opinion of yourself.
Under no pressure.

The best self-report instrument in the psychological literature — conscientiousness on the Big Five — explains 4% of job performance variance. The most widely used personality instrument in the world, the MBTI, gives half its test-takers a different four-letter type within five weeks. Not because they changed. Because the test was measuring something that doesn't exist at a meaningful level of precision.

This isn't a fringe critique. Nisbett and Wilson established in 1977 — in a paper cited over 2,600 times — that people have little or no direct introspective access to the cognitive processes that drive their decisions. When asked why they chose something, they confabulate. They produce plausible narratives. They don't know.

Self-report has a fundamental ceiling. Not a technical one — a theoretical one.

Job performance variance explained by best Big Five predictor (conscientiousness)

Barrick & Mount, 1991; Hurtz & Donovan, 2000; Salgado, 1997

50%

MBTI test-takers who receive a different type within five weeks

Pittenger, 1993, 2005

72%

Of behavioral variance unexplained even by explicit stated intentions — the most direct self-report predictor

Sheeran, 2002 — meta-analysis, N = 82,107

02 Revealed Preference

What you choose under constraint
is more informative than what you say.

Paul Samuelson established revealed preference theory in 1938: consumer preferences can be inferred entirely from observable choices under real constraints, eliminating the need for subjective utility reports. The core principle is simple. If you choose X when Y was affordable, you've revealed something real. No survey required.

Behavioral economics has since accumulated extensive evidence for the gap between stated and revealed preferences. Under emotional states, time pressure, or resource scarcity, people make decisions they would never have predicted from calm self-reflection. Ariely demonstrated this directly. Thaler and Sunstein showed that small changes in choice architecture dramatically shift behavior even when stated preferences remain constant.

Origin Protocol applies this logic to behavioral measurement. The simulation creates genuine constraints — scarce resources, irreversible decisions, moral conflict — that make impression management difficult and authentic behavioral expression more likely. We're not asking what you'd do. We're watching what you do.

"Observer-rated personality validities were 29% to 340% higher than self-report validities — and observer ratings showed incremental validity over self-reports. The reverse was not true."

Oh, Wang, & Mount (2011) — meta-analysis, N = 44,178

03 Why 21 Scenarios

Single observations are noisy.
Aggregated patterns are stable.

Walter Mischel's 1968 challenge to personality psychology — that trait-behavior correlations rarely exceeded r = .30 — was resolved not by defending single observations, but by understanding what aggregation does. William Fleeson's experience-sampling studies showed that while individual behavioral observations correlate with stable traits at r ≈ .25, the distribution parameters of aggregated behavior show stability of r = .91 to .97.

The noise cancels. The pattern remains. A platform measuring behavior across 21 scenarios is fundamentally different from any single behavioral observation or a 10-item self-report questionnaire.

The four phases — arrival, stakes, rupture, legacy — are designed using Trait Activation Theory (Tett & Burnett, 2003). Pressure must be trait-relevant, not merely stressful. An overly constraining situation forces everyone to respond the same way, producing no variance. Origin Protocol scenarios occupy the diagnostic middle: strong enough to suppress deliberate self-presentation, structured specifically to activate the behavioral differences being measured.

◈

Single behavioral observation: trait-behavior correlation r ≈ .20–.30 (Mischel's personality coefficient)

Mischel (1968); Funder & Ozer (1983)

◈

Aggregated behavioral distribution: stability r = .91–.97; trait prediction r = .42–.56

Fleeson (2001); Fleeson & Gallagher (2009) — meta-analysis, 20,000+ behavioral reports

◈

Situational judgment tests (the measurement category Origin Protocol belongs to): corrected validity ρ = .26–.34 for predicting real-world outcomes, outperforming the entire Big Five battery

McDaniel et al. (2001, 2007); Christian et al. (2010)

04 Pressure as a Diagnostic Tool

Under constraint,
impression management fails.

Kahneman's dual-process framework provides the mechanism. Under cognitive load, time pressure, or resource scarcity, System 2 — slow, deliberate, impression-managed — requires working memory it no longer has. System 1 takes over: fast, automatic, shaped by deep learning and habitual patterns. These patterns are more reflective of stable dispositional tendencies than the socially calibrated responses produced when pressure is absent.

Joshua Greene's moral cognition experiments demonstrated this directly: cognitive load selectively slowed utilitarian moral judgments while leaving deontological (rule-based, intuitive) responses unaffected. Under pressure, people default to their actual moral architecture, not their preferred self-presentation.

Mullainathan and Shafir's scarcity research adds another dimension: scarcity creates tunneling — involuntary focus on pressing constraints — that makes strategic impression management harder. Shah, Shafir, and Mullainathan (2015) found scarcity actually made people less susceptible to framing biases, because trade-off thinking became more consistent and less manipulable.

One important note: scenarios must reach moderate pressure, not maximum stress. The Yerkes-Dodson law establishes an inverted-U relationship between arousal and decision quality. Origin Protocol scenarios are calibrated for diagnostic pressure — enough to deactivate impression management, not enough to produce panic-driven responses that tell us nothing.

05 How We Compare

The measurement landscape,
honestly presented.

We're not asking you to take our word for it. Here is the validity evidence across instrument categories, drawn from published meta-analyses.

Instrument Type	Criterion Validity	Test-Retest	Faking Resistance	Notes
MBTI	~.20 (Big Five proxy)	50% reclassification at 5 wks	Low	Fails basic psychometric standards. Missing neuroticism entirely.
Big Five self-report	ρ = .20–.23 (best single trait)	r ≈ .82 (2 months)	d = .50–.70 applicant inflation	Defensible instrument. Still self-report. Susceptible to faking and confabulation.
Observer ratings	29–340% higher than self-report	High	High — no self-report to game	Gold standard validity. Requires trained observer. Not scalable.
Situational Judgment Tests	ρ = .26–.34	r = .698 (k=37 meta-analysis)	Better than self-report	Outperforms full Big Five battery. Incremental validity over cognitive ability + Big Five.
Origin Protocol	SJT category + aggregation	90-day minimum	High — no answer key	21-scenario aggregation. Behavioral distribution stability r = .91–.97.

06 The 90-Day Interval

Why reassessment is available
at 90 days, not 30.

The check-in interval is not arbitrary. Four artifacts contaminate any reassessment interval that is too short: item memory, practice effects, mood-state carryover, and exposure bias. The 90-day threshold is where the convergence of evidence is strongest.

Days

Item memory remains moderately strong. Mood-state carryover from initial testing is high. Practice effects not yet substantially dissipated. Responses contaminated by artifact.

✕ Too short

Days

Memory effects substantially reduced (Chmielewski & Watson, 2009). CogState data show negligible practice effects for computerized tasks. Mood correlation from shared circumstances persists.

◌ Absolute floor

Days — Recommended

Autobiographical memory research: substantial forgetting of specific content. Practice effects reduced ~60% (Scharfen et al., 2018). Seasonal/circumstantial mood correlation low. Life events sufficient for genuine behavioral adaptation.

✓ Evidence convergence

What does change at 90 days is behavioral expression and decision-making repertoire — how you respond now, given what has happened since. The reassessment captures adaptation, not transformation. Fundamental personality trait change requires years to decades; rank-order stability is approximately r = .98 annually (Conley, 1984). Origin Protocol frames check-in results accordingly.

07 What We Don't Claim

Honest about the limits.

The evidence is strong. It is not complete. We want to be clear about the boundaries of what we know.

Ego depletion is not a mechanism we rely on. Once the flagship theory supporting pressure-based assessment, it has largely failed to replicate across 36 laboratories (Vohs et al., 2021: d = 0.06, non-significant). We don't cite it, and we don't need it.

Stress responses are trainable. Meichenbaum's stress inoculation research confirms this. What pressure-based behavioral assessment reveals is your current adaptive repertoire — not an immutable disposition. Someone who has developed genuine emotional resilience through experience will respond differently than they did before that development. That is not a flaw. That is the instrument working correctly.

Game-based assessment validation remains early-stage. A 2025 meta-analysis found support for convergent validity but noted substantial heterogeneity and absence of standardized psychometric frameworks. Origin Protocol is in this emerging category. We're building the evidence base, not claiming it's complete.

Revealed preference has a ceiling. Arslan et al. (2020) found that survey-based stated risk preferences outperformed laboratory behavioral tasks in predicting real-world risk-taking — because respondents spontaneously recalled relevant past behavior when completing self-reports. The advantage of behavioral assessment is greatest when self-report cannot draw on relevant behavioral memories. Novel scenarios create exactly this condition.

Key References

Barrick, M.R., & Mount, M.K. (1991). The big five personality dimensions and job performance. Personnel Psychology, 44(1), 1–26.

Christian, M.S., Edwards, B.D., & Bradley, J.C. (2010). Situational judgment tests. Personnel Psychology, 63(3), 461–533.

Chmielewski, M., & Watson, D. (2009). What is being assessed and why it matters. Psychological Assessment, 21(2), 107–118.

Conley, J.J. (1984). The hierarchy of consistency. Personality and Individual Differences, 5(1), 11–25.

Connelly, B.S., & Ones, D.S. (2010). An other perspective on personality. Psychological Bulletin, 136(6), 1092–1122.

Fleeson, W. (2001). Toward a structure- and process-integrated view of personality. Journal of Personality and Social Psychology, 80(6), 1011–1027.

Greene, J.D., Morelli, S.A., Lowenberg, K., Nystrom, L.E., & Cohen, J.D. (2008). Cognitive load selectively interferes with utilitarian moral judgment. Cognition, 107(3), 1144–1154.

Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux.

Mani, A., Mullainathan, S., Shafir, E., & Zhao, J. (2013). Poverty impedes cognitive function. Science, 341(6149), 976–980.

McDaniel, M.A., Hartman, N.S., Whetzel, D.L., & Grubb, W.L. (2007). Situational judgment tests, response instructions, and validity. Personnel Psychology, 60(1), 63–91.

Mischel, W. (1968). Personality and Assessment. Wiley.

Nisbett, R.E., & Wilson, T.D. (1977). Telling more than we can know. Psychological Review, 84(3), 231–259.

Oh, I.S., Wang, G., & Mount, M.K. (2011). Validity of observer ratings of the five-factor model. Journal of Applied Psychology, 96(4), 762–773.

Pittenger, D.J. (2005). Cautionary comments regarding the Myers-Briggs Type Indicator. Consulting Psychology Journal, 57(3), 210–221.

Roberts, B.W., & DelVecchio, W.F. (2000). The rank-order consistency of personality traits. Psychological Bulletin, 126(1), 3–25.

Samuelson, P.A. (1938). A note on the pure theory of consumer's behaviour. Economica, 5(17), 61–71.

Scharfen, J., Peters, J.M., & Holling, H. (2018). Retest effects in cognitive ability tests. Intelligence, 67, 44–66.

Sheeran, P. (2002). Intention-behavior relations. European Review of Social Psychology, 12(1), 1–36.

Tett, R.P., & Burnett, D.D. (2003). A personality trait-based interactionist model of job performance. Journal of Applied Psychology, 88(3), 500–517.

Vohs, K.D., et al. (2021). A multisite preregistered paradigmatic test of the ego-depletion effect. Psychological Science, 32(10), 1566–1581.

The science ofbehavior under pressure.

Personality tests measure your opinion of yourself.Under no pressure.

What you choose under constraintis more informative than what you say.

Single observations are noisy.Aggregated patterns are stable.

Under constraint,impression management fails.

The measurement landscape,honestly presented.

Why reassessment is availableat 90 days, not 30.