How It's Built

OpenScripture is not only a Bible reader. It is a data project that connects published translations, source-language morphology, manuscript signals, commentary, cross-references, and personal study choices into one careful reading experience.

The most distinctive technical layer is a translation-specific reverse interlinear: word-level alignment between the English text a reader chose and the original Hebrew, Aramaic, or Greek beneath it — keyed to Strong's numbers so every link is stable, comparable, and lockable.

OpenScripture data path

Evidence to alignment to reader

1

Source data

Hebrew, Aramaic, Greek, Latin, English, manuscript signals, notes, canons, and licensing terms all enter the system as separate evidence streams.

2

Alignment graph

OpenScripture builds a reverse interlinear for every supported translation: reviewed links between Strong's-keyed source tokens and each translation's actual English wording.

3

Reader signals

The app uses those links for interlinear study, Word Locks, translation comparison, certainty signals, and AI translation context.

Why This Matters

Many Bible apps make digital text fast and searchable. OpenScripture is aiming at a different layer of accessibility: helping readers see why translations differ, how English words connect back to the source languages, and where manuscript evidence is more or less settled.

That means treating technical infrastructure as part of the mission. If the data is careful, the reader can be simple. A tap can open the original-language word. A small marker can explain a real translation difference. A personal Word Lock can help someone learn recurring vocabulary while reading naturally.

The Alignment Spine: Strong's Numbers, Manuscripts, and Lanes

A traditional interlinear Bible keeps the original Hebrew or Greek word order and places English beneath it — useful to a linguist, disorienting to most readers. A reverse interlinear works the other way: the English text stays exactly as the translator wrote it, and the matching original-language data appears beneath each word in natural reading order. Building this per-translation, for many different English Bibles, is the central technical challenge of OpenScripture.

The shared backbone that makes per-translation alignment possible is Strong's numbers — a stable lexical ID system introduced by James Strong in 1890 and still the most widely used cross-translation linking scheme in Bible software. A Strong's number stays constant across translations: whether an English Bible renders a word as "LORD" or "Yahweh," both point to the same Hebrew lexeme (H3068). That shared key is what makes Word Locks work: lock a Strong's number once, and the preference propagates wherever that word appears, across every translation on the same alignment spine. Strong's numbers are paired with morphological tagging — part of speech, grammatical case, verb tense, and stem — that describes how a word functions in its sentence, not just what it means.

Three Manuscript Foundations

A translation should be aligned to the text it was actually translated from. Most Protestant English Bibles follow the Hebrew Masoretic Text and the Greek New Testament — but translations descended from the Greek Septuagint or the Latin Vulgate need their own layer. Forcing them onto the Hebrew spine produces false links.

Hebrew & Aramaic OT

Westminster Leningrad Codex (Masoretic Text)

Morphology from the OpenScriptures Hebrew Bible (OSHB / morphhb), CC BY 4.0. The traditional Hebrew source for most Protestant English Bibles.

Greek NT

SBL Greek New Testament (SBLGNT)

Edited by Michael W. Holmes (SBL & Logos). Paired with MorphGNT morphology: every Koine Greek word tagged by lemma, part of speech, case, tense, and stem.

Septuagint & Vulgate

Separate original-language layers

The Greek Septuagint (LXX) and the Latin Clementine Vulgate are maintained as their own layers. Translations that descend from them are aligned to those texts — not forced onto the Hebrew Masoretic spine.

Three Alignment Lanes

Not all alignment data is created the same way. The system assigns each translation to the most reliable lane available:

1

Lane A — Authoritative bridges

Some translations carry scholar-made word-level tagging: the KJV via STEPBible's TAGNT/TAHOT data, the Berean Standard Bible via its published interlinear, and the unfoldingWord Literal Text via its hand-built alignment. These are the gold standard for their translation families.

2

Lane B — Publisher alignment

Some publishers ship word-alignment data alongside their translation text. The NET Bible provides its own alignment data and is treated as authoritative for that translation.

3

Lane C — Computer-assisted alignment

For translations without pre-built alignment data, the pipeline computes alignment mechanically, then scores and audits each row using Alignment Error Rate before it reaches the reader.

Feature by Feature

Each feature has a visible reader experience and a less visible data problem underneath it. The work is to make the hidden layer rigorous enough that the visible layer can feel calm, quick, and trustworthy.

The big data layer

Reverse Interlinear: Translation-Specific Alignment

How it is built

A traditional interlinear keeps the original Hebrew or Greek word order and contorts the English around it. A reverse interlinear keeps the English exactly as the translator wrote it, and places the matching original-language data beneath each word in natural reading order — per translation, for every supported Bible. Where an authoritative alignment source exists (KJV via STEPBible, the Berean Standard Bible's published interlinear, the unfoldingWord Literal Text), that data is used directly. For other translations, the pipeline runs a computer-assisted alignment engine: it assigns each source token a set of candidate English spans, then scores them using lexical glosses, learned translation probabilities (how likely this translation renders a given Hebrew or Greek word with a specific English word), part-of-speech and lemma matching, and positional evidence. Reviewed rows become runtime gold data; generated rows stay labelled as approximate until they pass Alignment Error Rate audit.

Technical complications

Biblical Hebrew is often Verb-Subject-Object; English is Subject-Verb-Object, so the source word rarely lands directly below its English rendering. The engine models positional displacement — how far a word is likely to drift between source and target — using the same diagonal model that drives statistical machine translation systems. One Hebrew word can become an English phrase and vice versa; translators also supply words English grammar requires that have no separate source token (the possessive in a Hebrew construct chain, the copula, the article). The pipeline explicitly models supplied and implied words rather than forcing false one-to-one matches. Disambiguation is harder still: 'the LORD God' in Genesis 2 maps to two Hebrew words whose glosses overlap — the system separates them using learned co-occurrence, morphology, and local positional evidence. Transliterated names that vary across translations (Nebuchadnezzar vs Nabuchodonosor) are matched using edit distance on the orthographic form.

Desired result

The operating principle is "zero wrong anchors": a confident wrong link is worse than a visible gap. Mechanical alignments are evaluated against hand-verified rows using Alignment Error Rate (AER: precision and recall across sure and possible links). Only alignments that pass are promoted to gold. Hard cases — free paraphrase, idiomatic collapses, English idioms where no source word cleanly corresponds — are flagged for careful review rather than guessed. When a reader taps a word in the English text, they see original-language data the system can stand behind.

Learning by reading

Word Locks and Personalised Bibles

How it is built

Word Locks are keyed to source-language identity, normally a Strong's number plus the aligned source token. In Composite mode, a reader can choose a preferred rendering for a word, and the app applies it wherever the alignment is suitable. Verse Locks and Word Locks share the same personalisation model. By default, Word Locks still apply inside a verse-locked verse; <a href="https://app.openscripture.io/profile?tab=account" class="text-[var(--accent-primary)] hover:underline">Profile → Account → Lock appearance</a> lets readers choose verse-first wording instead.

Technical complications

This only works if the alignment is compact enough for substitution. If one source token accidentally owns a whole English clause, a Word Lock would damage the sentence. The pipeline therefore uses a substitution test: replacing one direct anchor should leave the surrounding English intelligible. Connector words such as articles and prepositions stay out of the lock path so locking the meaningful word does not quietly swallow its neighbours. Publisher policies also matter, so the write path has to respect translation-specific restrictions rather than relying on the button being hidden.

Desired result

A reader can build vocabulary in context. Instead of studying a word once in a separate lexicon, they can see that word reappear across Scripture with their chosen rendering, while the rest of the verse remains connected to published translation text.

Why translations disagree

Translation Difference Symbols

How it is built

OpenScripture stores precomputed divergence data by verse. Entries are classified by what kind of difference is present: source-text or canon split, theological or interpretive rendering, or translation philosophy. The reader sees circled markers in context, and the drawer explains the difference with the relevant renderings grouped by tradition.

Technical complications

The hard part is judgment. A visible wording difference is not automatically a meaningful disagreement. The pipeline has to avoid inflating ordinary style differences into manuscript issues, avoid hiding important textual variants, keep explanations short, and store only phrase-level renderings so copyright boundaries stay respected.

Desired result

Readers get a small signal exactly where it helps: this verse is translated differently, and here is why. The goal is not to push a preferred wording, but to help people notice the scholarly landscape behind familiar English phrases.

Manuscript context without overload

Textual Certainty Signals

How it is built

Textual certainty data is stored sparsely at word or passage level. The pipeline can draw candidates from documented reference variants, translation editorial brackets, and SBLGNT/MorphGNT signals, then attach scores and reasons to morphology word positions. The reader setting decides how strongly those signals appear.

Technical complications

Textual certainty is adjacent to translation disagreement, but it is not the same thing. A translation can differ because of style even when the source text is stable, or because the underlying manuscript reading is genuinely contested. The app keeps those signals separate. It also has to respect licensing limits around critical apparatus material, storing only what OpenScripture is allowed to store.

Desired result

Stable readings stay quiet. More debated readings can be marked when the reader wants that level of detail. The result is a Bible reader that can surface manuscript uncertainty without turning every chapter into a specialist apparatus.

Experimental, labelled, and source-aware

AI Translation Mode

How it is built

The AI Translation mode produces multiple style and emphasis combinations, using source-language morphology and permitted source material rather than simply paraphrasing a copyrighted English translation. Generated text carries confidence and decision metadata, and AI alignment is expected to follow the same word-data contract where alignment payloads exist. <a href="/pricing" class="text-[var(--accent-primary)] hover:underline">Dynamic/Readability is free</a>; Premium unlocks all nine combos.

Technical complications

AI output is only useful if it is labelled honestly and kept inside a disciplined data model. The pipeline needs provenance, regeneration triggers, source-word context, and clear separation from publisher-authored Bible text. It also needs to avoid treating a fluent model sentence as automatically aligned or authoritative.

Desired result

Readers can explore how a passage might be rendered under different translation goals while still seeing where published translations, source-language data, and divergence signals provide firmer ground.

The quiet infrastructure

Translation Ingestion, Notes, and Canons

How it is built

Each new translation has to pass through licensing, metadata, verse text ingestion, publisher notes, copyright notices, reader visibility, search, comparison, and word-data checks. Where a publisher provides notes, introductions, commentary, or cross-references, those sources are normalized so the drawer can show the right material for the verse the reader tapped.

Technical complications

Publishers deliver data in different formats. Word documents, USFM, JSON, public-domain files, study notes, cross-reference lists, and commentary all behave differently. Versification can differ. Canon scope can differ. Formatting such as italics, bold, paragraphing, quotation layout, and note anchors is part of the meaning, so the parser cannot simply flatten everything into plain text.

Desired result

The desired result is a broad, respectful reader across Protestant, Catholic, Orthodox, Ecumenical, Jewish, and Independent traditions, with each translation shown on its own terms and connected to the same study surfaces where the data allows.

The Pipeline Pattern

The same discipline shows up across translation ingestion, word alignment, divergence explanations, and certainty data. OpenScripture tries to keep the original source, the generated candidate, the review status, and the reader-facing claim separate until the evidence is strong enough.

  1. 1

    Select the alignment lane: authoritative scholar tagging, publisher-supplied alignment, or computer-assisted — whichever is most reliable for this translation.

  2. 2

    Ingest the publisher text and notes from the most authoritative available source.

  3. 3

    Normalize books, chapters, verses, notes, formatting, copyright terms, and reader visibility.

  4. 4

    Attach morphology, Strong's data, source tokens, and translation-specific English token positions.

  5. 5

    Generate candidate alignments and divergence explanations, then score against verified examples using Alignment Error Rate.

  6. 6

    Promote reviewed data into the runtime tables that power the reader, drawer, interlinear view, and lock system.

  7. 7

    Keep uncertainty visible: approximate alignment stays approximate, reviewed data earns stronger language, and gaps remain labelled rather than hidden.

Research direction — not yet fully shipped

Where the Alignment Method Is Heading

The computer-assisted alignment lane is being extended around a principle from modern natural language processing research: combine several independent lines of evidence and trust a link when they agree. A link corroborated by lexical glosses, learned translation probabilities, part-of-speech and lemma matching, position, and semantic similarity is more trustworthy than one supported by a single signal.

The evidence sources the pipeline is designed to fuse include: lexical glosses, translation-probability models trained on confirmed alignments, part-of-speech and lemma matching, agreement across multiple translations sharing the same Strong's spine, local semantic embeddings, statistical aligners, syntactic structure, and orthographic similarity for transliterated name variants. Agreement across independent sources earns higher confidence; disagreement flags a case for human review. Score fusion of this kind draws on the same principle as Reciprocal Rank Fusion in the information retrieval literature.

Literal translations align almost word-for-word and move through this system cleanly. Dynamic or paraphrase translations reorder, expand, and supply more — so their alignment is partial by nature, and the system is designed to say so rather than fabricate coverage. Where a layer is not yet finished, the app shows what data exists rather than guessing. See current status for live word-link and AI-generation progress.

A Contribution to Digital Bible Accessibility

The broader movement is not only about putting more Bible text online. It is about making the depth behind the text easier to reach: original languages, translation philosophy, textual history, commentary, cross-references, and personal study patterns.

OpenScripture contributes by building a reader where those layers can be available without overwhelming the page. The work is slow because the details matter, but the payoff is a Bible experience that can become more open, more transparent, and more useful with each data layer added.

  • Make deep study tools understandable for ordinary readers, not only specialists.
  • Let many translation traditions sit beside each other without flattening their differences.
  • Expose the data limits honestly so digital convenience does not become false certainty.
  • Use modern software, careful licensing, and reviewable pipelines to make Scripture study more accessible over time.

Sources and Methods

OpenScripture is built on open scholarship. The alignment pipeline re-implements ideas from the computational linguistics research below; it does not vendor these tools directly. License terms are listed as published by each project — verify at the linked source before relying on them.

Data Sources and Texts

Tagged Greek New Testament and Tagged Ancient Hebrew/Aramaic OT, including KJV word-level alignment and Strong's links.

Westminster Leningrad Codex (the Masoretic Text) with full morphological tagging for every Hebrew and Aramaic word.

Edited by Michael W. Holmes (Society of Biblical Literature & Logos Bible Software). Paired with MorphGNT morphology.

Word-level interlinear alignment published alongside the Berean Bible translation.

Hand-built word-level alignment to the original-language texts, published by unfoldingWord.

Translation text and word-alignment data for the NET Bible.

Syntactic trees and additional linguistic annotation for biblical Hebrew and Koine Greek.

Brenton's English Septuagint

Public domain

English translation of the Greek Old Testament (Septuagint / LXX), used as base for LXX-aligned translations.

Clementine Vulgate

Public domain

Latin Vulgate base text, used as the original-language layer for Vulgate-descended translations such as the Douay-Rheims.

Strong's Exhaustive Concordance (James Strong, 1890)

Public domain

The original lexical numbering system for biblical Hebrew and Greek — still the most widely used cross-translation linking scheme in Bible software.

Methods and Research

The alignment engine adapts ideas from the following computational linguistics literature. These are the intellectual lineage of the method; OpenScripture re-implements the core concepts for the biblical alignment domain rather than vendoring any of these tools.

  1. 1

    Brown, P. F., et al. (1993). The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2).

    IBM Models 1–5: the foundational statistical word alignment framework underpinning most aligners.

  2. 2

    Vogel, S., Ney, H., & Tillmann, C. (1996). HMM-based word alignment in statistical translation. COLING.

    Introduced the positional (diagonal) displacement model for word order differences between source and target.

  3. 3

    Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1).

    Defined Alignment Error Rate (AER): precision and recall across sure and possible links.

  4. 4

    Melamed, I. D. (2000). Models of translational equivalence among words. Computational Linguistics, 26(2).

    Competitive linking — resolving many-to-one and one-to-many alignment conflicts.

  5. 5

    Moore, R. C. (2004). Improving IBM word alignment Model 1. ACL.

    Practical improvements to translation probability estimation for small or sparse parallel corpora.

  6. 6

    Liang, P., Taskar, B., & Klein, D. (2006). Alignment by agreement. NAACL.

    Training source→target and target→source models to agree — key inspiration for corroborated-evidence alignment.

  7. 7

    Koehn, P., et al. (2007). Moses: Open source toolkit for statistical machine translation. ACL.

    Reference SMT implementation including symmetrisation heuristics (grow-diag-final) used across the field.

  8. 8

    Dyer, C., Chahuneau, V., & Smith, N. A. (2013). A simple, fast, and effective reparameterization of IBM Model 2. NAACL.

    fast_align: efficient alignment with a strong diagonal prior, widely used for new language pairs.

  9. 9
  10. 10
  11. 11

    Dou, Z., & Neubig, G. (2021). Word alignment by fine-tuning embeddings on parallel corpora. EACL.

    awesome-align: fine-tuned contextual embeddings achieving state-of-the-art alignment on standard benchmarks.

  12. 12

    Imani, A., et al. (2021). Graph-based multilingual word alignment. EACL.

    Multi-parallel consensus alignment — leveraging agreement across multiple translation pairs sharing a common source.

  13. 13
  14. 14

    Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8).

    Edit distance — orthographic similarity for transliterated name variants (Nebuchadnezzar / Nabuchodonosor).

  15. 15

    Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python. O'Reilly.

    NLTK — reference NLP toolkit for tokenization and lemmatization.

  16. 16

Built for Readers, Designed for Evidence

The app should feel simple, but the simplicity is earned by the pipeline beneath it: careful ingestion, honest alignment, reviewable automation, and visible uncertainty where certainty would be dishonest.