Datasets / Models / Documentation

Public datasets and model releases behind the work.

This page is a clean map of the public artifacts that matter most: speech datasets, large Amharic corpora, Shook ASR checkpoints, and the Gemma CPT and SFT line. It is meant to make the technical signal legible without forcing people to dig through profiles and activity feeds.

wxl_amh

Speech / 3k rows / Waxal Amharic audio transcriptions

The speech dataset that should lead the page. It makes the Amharic voice-data work legible immediately and ties directly to the Waxal gap you called out.

  • 3k rows
  • 988 audio files in repo tree
  • Speech + transcription

Documentation

  • Best public proof point for the voice-data side of the work.
  • Useful for ASR training, transcription workflows, and speech evaluation in Amharic.
  • Pairs naturally with the Shook speech models on the same page.

amharic-combined-corpus

Text / 10.1M rows / combined Amharic corpus

The strongest text-scale artifact to highlight. It gives the page real substance on the corpus side and supports pretraining, adaptation, and downstream evaluation.

  • 10.1M rows
  • 2.29 GB
  • 50 downloads last month

Documentation

  • The Hugging Face viewer reports 10,056,352 rows in the train split.
  • Most relevant as the text-layer proof behind Amharic model adaptation work.
  • Belongs next to the Gemma CPT line because the story is corpus first, model second.

wikipedia-amharic

Bilingual text / 55.8k rows / English-Amharic Wikipedia

A clean bilingual knowledge asset with real documentation behind it. It is one of the easiest releases on the page for investors, journalists, and researchers to understand quickly.

  • 55.8k rows
  • Apache-2.0
  • English + Amharic

Documentation

  • Translated by Addis AI - Aleph and documented as a parallel English-Amharic corpus.
  • Useful for translation, retrieval, QA, and knowledge-grounded language work.
  • Good supporting proof because it is both technically relevant and easy to explain.

shook-medium-amharic-2k

ASR / 0.8B params / WER 9.1091

The clearest public speech-model proof point on the page. It publishes evaluation numbers, download activity, and training hyperparameters, which makes it the strongest Shook release to lead with.

  • 420 downloads last month
  • 0.8B params
  • WER 9.1091
  • LR 1e-5 / batch 16 / grad acc 4 / 2 epochs

Documentation

  • Current Hugging Face card includes a full training-results table and public hyperparameters.
  • Best Shook model to lead with because it is measurable, not just described.
  • Strongest public link between the speech data layer and the deployed ASR story.

shook-tiny-amharic-600hr

ASR / 37.8M params / WER 22.1786

The smaller, lighter end of the Shook speech stack. It matters because it shows the work is not only a single flagship checkpoint.

  • 37.8M params
  • Fine-tuned from whisper-tiny
  • WER 22.1786

Documentation

  • Training card exposes hyperparameters including lr 3e-5, batch size 96, and 2 epochs.
  • Good supporting proof for compact speech deployment work.
  • Most useful when shown as the smaller companion to shook-medium-amharic-2k.

gemma-2-27b-amharic-alpaca-sft

Amharic SFT track / 27B class model

The supervised fine-tuning layer in the Gemma Amharic line. It is the clearest public signal that the model work extends beyond speech into instruction-following language systems.

  • 27B class
  • SFT track
  • Instruction-following

Documentation

  • Model card states that it is fine-tuned on top of gemma-2-27b-amharic-cpt.
  • Supports conversational interactions in Amharic.
  • Best framed as the top of the stack after corpus work and continued pretraining.

gemma-2-27b-amharic-cpt

Amharic CPT track / 27B class model

The continued pretraining track in the Gemma Amharic line. It is the right model to highlight when you want to show foundation-model adaptation rather than only application-layer fine-tuning.

  • 27B class
  • ~2B tokens of Amharic
  • Expanded tokenizer

Documentation

  • Model card says the tokenizer was expanded from the original 256k vocabulary to better cover Amharic script.
  • Continual pretraining was done on roughly 2B tokens of Amharic corpus data.
  • This is the cleanest public proof that the stack reaches the foundation-model adaptation layer.

Dataset

Open

amharic_courpus_tg_1

Text / 477k rows / Amharic corpus release

A useful mid-scale corpus release that broadens the public Amharic text layer beyond the flagship combined corpus.

Dataset

Open

the-stack-amharic

Code + translation / Amharic developer dataset

A more niche release, but a strategically important one because it points toward developer tooling and code-oriented Amharic infrastructure.

Model

Open

shook-900-base

Base model / 72.6M params

A base Shook checkpoint that helps make the model family visible beyond the tuned Amharic releases.