Datasets / Models / Documentation

Public datasets and model releases behind the work.

Speech datasets, large Amharic corpora, Shook ASR checkpoints, and the Gemma CPT and SFT releases together form the clearest public record of the stack.

Featured Datasets

Dataset

Open on Hugging Face

wxl_amh

Speech / 3k rows / Waxal Amharic audio transcriptions

A focused Amharic speech dataset built around audio and transcription pairs, useful for ASR training, transcription workflows, and speech evaluation.

3k rows
988 audio files in repo tree
Speech + transcription

Documentation

Best public proof point for the voice-data side of the work.
Useful for ASR training, transcription workflows, and speech evaluation in Amharic.
Connects directly to the speech-model work in the Shook line.

Dataset

Open on Hugging Face

amharic-combined-corpus

Text / 10.1M rows / combined Amharic corpus

A large-scale Amharic corpus for pretraining, adaptation, retrieval, and downstream evaluation work.

10.1M rows
2.29 GB
50 downloads last month

Documentation

The Hugging Face viewer reports 10,056,352 rows in the train split.
Most relevant as the text-layer proof behind Amharic model adaptation work.
Belongs next to the Gemma CPT line because the story is corpus first, model second.

Dataset

Open on Hugging Face

wikipedia-amharic

Bilingual text / 55.8k rows / English-Amharic Wikipedia

A bilingual English-Amharic knowledge asset useful for translation, retrieval, question answering, and grounded language tasks.

55.8k rows
Apache-2.0
English + Amharic

Documentation

Translated by Addis AI - Aleph and documented as a parallel English-Amharic corpus.
Useful for translation, retrieval, QA, and knowledge-grounded language work.
Good supporting proof because it is both technically relevant and easy to explain.

Featured Models

Model

Open on Hugging Face

shook-medium-amharic-2k

ASR / 0.8B params / WER 9.1091

A high-signal Amharic ASR checkpoint with published evaluation numbers and training hyperparameters.

420 downloads last month
0.8B params
WER 9.1091
LR 1e-5 / batch 16 / grad acc 4 / 2 epochs

Documentation

Current Hugging Face card includes a full training-results table and public hyperparameters.
Best Shook model to lead with because it is measurable, not just described.
Strongest public link between the speech data layer and the deployed ASR story.

Model

Open on Hugging Face

shook-tiny-amharic-600hr

ASR / 37.8M params / WER 22.1786

The smaller, lighter end of the Shook speech stack. It matters because it shows the work is not only a single flagship checkpoint.

37.8M params
Fine-tuned from whisper-tiny
WER 22.1786

Documentation

Training card exposes hyperparameters including lr 3e-5, batch size 96, and 2 epochs.
Good supporting proof for compact speech deployment work.
Most useful when shown as the smaller companion to shook-medium-amharic-2k.

Model

Open on Hugging Face

gemma-2-27b-amharic-alpaca-sft

Amharic SFT track / 27B class model

The supervised fine-tuning layer in the Gemma Amharic line. It is the clearest public signal that the model work extends beyond speech into instruction-following language systems.

27B class
SFT track
Instruction-following

Documentation

Model card states that it is fine-tuned on top of gemma-2-27b-amharic-cpt.
Supports conversational interactions in Amharic.
Best framed as the top of the stack after corpus work and continued pretraining.

Model

Open on Hugging Face

gemma-2-27b-amharic-cpt

Amharic CPT track / 27B class model

The continued pretraining track in the Gemma Amharic line. It is the right model to highlight when you want to show foundation-model adaptation rather than only application-layer fine-tuning.

27B class
~2B tokens of Amharic
Expanded tokenizer

Documentation

Model card says the tokenizer was expanded from the original 256k vocabulary to better cover Amharic script.
Continual pretraining was done on roughly 2B tokens of Amharic corpus data.
This is the cleanest public proof that the stack reaches the foundation-model adaptation layer.

More Releases

Dataset

Open

amharic_courpus_tg_1

Text / 477k rows / Amharic corpus release

A useful mid-scale corpus release that broadens the public Amharic text layer beyond the flagship combined corpus.

Dataset

Open

the-stack-amharic

Code + translation / Amharic developer dataset

A more niche release, but a strategically important one because it points toward developer tooling and code-oriented Amharic infrastructure.

Model

Open

shook-900-base

Base model / 72.6M params

A base Shook checkpoint that helps make the model family visible beyond the tuned Amharic releases.