This project began as a personal learning tool for studying the Greek New Testament and Hebrew Old Testament in their original languages. It brings together an interest in data, structured workflows, and a desire to read the Scriptures more carefully in the original languages.
The raw data was extracted from interlinear Bible sources, producing per-word records with morphological parsing, Strong's concordance numbers, and English glosses. Each corpus was processed through an automated pipeline:
| Column | Description | Example |
|---|---|---|
| Greek Word | The Greek word form as it appears in the text | λόγος |
| english_translation | English gloss / interlinear translation | word |
| parsing_abbreviation | Morphological parsing code | N-NMS |
| strongs_number | Strong's Greek concordance number | G3056 |
| chapter_ref | Book and chapter reference | John 1 |
| Row Ordinal | Sequential position number | 1 |
| Column | Description | Example |
|---|---|---|
| hebrew_word | The Hebrew word form with cantillation | בראשִׁית |
| lemma | Strong's number (lexical form reference) | H7225 |
| transliteration | Romanized pronunciation | bə·rē·šîṯ |
| english_translation | English gloss | In the beginning |
| parsing_abbreviation | Morphological parsing code | N‑fs |
| strongs_number | Strong's Hebrew concordance number | H7225 |
| book | Book name | Genesis |
| chapter | Chapter number | 1 |
| ordinal | Sequential position number | 1 |
Both strategies use a greedy algorithm. They share the same starting book — the one whose vocabulary set has the highest overlap with the most frequent words across the entire corpus.
At each step, pick the remaining book that introduces the fewest new lemmas. This keeps vocabulary growth as gradual as possible.
At each step, pick the remaining book where you already know the highest percentage of its vocabulary. This maximizes reading comprehension at each step.
The Greek NT reading order was originally inspired by Greek for Life by Jonathan T. Pennington, which recommends reading the NT in a specific order to build vocabulary naturally.
The computational approach extends this idea by using actual corpus data to optimize the sequence algorithmically.
This project is licensed under CC BY 4.0. You are free to use, share, and adapt the data with attribution.
Source code: GitHub