Capstone: A CSV Summariser
Pulling everything together into one small, real program
Twelve lessons ago you printed hello, world. You now know enough to build a real, complete program — one that reads data, decides what it’s looking at, computes statistics, handles its own errors, and prints a clean report.
This lesson builds exactly that. Every concept you’ve met shows up in one small file.
A capstone isn’t new material. It’s the moment the separate pieces — ownership, references, structs, enums, traits, iterators, Result, the ? operator — stop being individual lessons and become a way of writing programs. We’ll build a CSV summariser: give it a table of comma-separated values, and it tells you, per column, whether that column holds integers, floating-point numbers, or text, plus the count / min / max / mean for the numeric ones.
It’s small enough to fit on this page and big enough to touch nearly every idea from the curriculum. Read it once for the shape, then run it, then change it.
A few words you’ll need
- CSV — “comma-separated values”: a plain-text format where each line is a row and commas separate the cells. Spreadsheets export it; mountains of real data arrive this way.
- Standard output (stdout) — where
println!goes by default, usually your terminal. - Standard error (stderr) — a separate output channel for error messages, by convention.
eprintln!writes here. - Exit code — a number a program hands the operating system when it finishes.
0means success, anything else means failure; shells use it to chain commands.
What we’re building
Feed the program this table:
name,age,salary
Ada,36,125000.00
Grace,57,98500.50
Linus,42,48000.00
Dennis,21,42000.00
and it prints:
column type count min max mean
name text 4 - - -
age int 4 21.00 57.00 39.00
salary float 4 42000.00 125000.00 78375.12
The logic, in four steps:
- Read the CSV text.
- For each column, look at every value and decide: integer, floating-point, or text.
- For numeric columns, compute count / min / max / mean.
- Print a fixed-width table. Any failure (empty input, a ragged row) becomes a typed error, a one-line message on stderr, and a non-zero exit code.
One column, three guesses
Before the whole program, here’s its beating heart in isolation: a classify function that walks a column’s values and guesses its type. The rule is simple and greedy — if every value parses as an integer, it’s an int; else if every value parses as a float, it’s a float; otherwise it’s text.
Hit Run and watch three different columns get three different verdicts:
Notice how much of the curriculum is already here. &[&str] is a slice (lesson 6) — the function borrows the values, it doesn’t own them. .iter() and .all() are iterator methods (lesson 9); .all() short-circuits the moment one value fails to parse. v.parse::<i64>() returns a Result (lesson 10), and .is_ok() collapses it to a plain bool. Four lessons, four lines.
The types
A handful of types do most of the work in the full program. Read them slowly — every line reaches back to an earlier lesson.
use std::fmt;
#[derive(Debug)]
enum ColumnType {
Int,
Float,
Text,
}
impl fmt::Display for ColumnType {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
match self {
ColumnType::Int => write!(f, "int"),
ColumnType::Float => write!(f, "float"),
ColumnType::Text => write!(f, "text"),
}
}
}
#[derive(Debug, Default)]
struct ColumnStats {
count: usize,
min: Option<f64>,
max: Option<f64>,
sum: f64,
}
#[derive(Debug)]
struct ColumnReport {
name: String,
kind: ColumnType,
stats: ColumnStats,
}
#[derive(Debug)]
enum AppError {
Parse(String),
}
Things worth noticing before we run anything:
- Enums model both data and failure.
ColumnTypeis one of three variants;AppErroris the program’s one failure mode (a bad parse). Matching on either is exhaustive — add a variant later and the compiler points at everymatchthat needs updating. #[derive(...)]is “compiler, write the boilerplate for me.”Defaulthands usColumnStats::default()— a zero /Nonestarting point — for free.Debuggives us{:?}printing for development.Displayis implemented by hand for the types we show users.fmtis the trait method; the'_is a lifetime placeholder you can read as “fill this in.”Option<f64>for min/max. They’re genuinely absent until the first numeric value arrives. Reaching forOptioninstead of a sentinel likef64::NANis the Rust style — the type makes “nothing yet” honest.
Parsing the table
fn parse(input: &str) -> Result<(Vec<String>, Vec<Vec<String>>), AppError> {
let mut lines = input.lines();
let header = lines.next().ok_or_else(|| AppError::Parse("empty input".into()))?;
let columns: Vec<String> = header.split(',').map(|s| s.trim().to_string()).collect();
let mut rows: Vec<Vec<String>> = Vec::new();
for (i, line) in lines.enumerate() {
let cells: Vec<String> = line.split(',').map(|s| s.trim().to_string()).collect();
if cells.len() != columns.len() {
return Err(AppError::Parse(format!(
"row {} has {} cells, expected {}",
i + 2, cells.len(), columns.len()
)));
}
rows.push(cells);
}
Ok((columns, rows))
}
We borrow input: &str rather than take ownership — the caller keeps its string. .lines() is a lazy iterator; .next() peels off the header. .enumerate() adds an index so an error can name the real row number (i + 2, because the header is row 1 and enumerate starts at 0).
ok_or_else is the bridge from Option to Result: it turns “there was no header line” (None) into our typed Err(AppError::Parse(...)), which the ? then propagates.
Inferring and summarising
fn classify(values: &[&str]) -> ColumnType {
if values.iter().all(|v| v.parse::<i64>().is_ok()) {
return ColumnType::Int;
}
if values.iter().all(|v| v.parse::<f64>().is_ok()) {
return ColumnType::Float;
}
ColumnType::Text
}
fn summarise(values: &[&str], kind: &ColumnType) -> ColumnStats {
let mut stats = ColumnStats::default();
stats.count = values.len();
if matches!(kind, ColumnType::Text) {
return stats;
}
for v in values {
let n: f64 = v.parse().unwrap(); // safe: classify already proved this parses
stats.sum += n;
stats.min = Some(stats.min.map_or(n, |m| m.min(n)));
stats.max = Some(stats.max.map_or(n, |m| m.max(n)));
}
stats
}
summarise shows a tidy Option pattern: stats.min.map_or(n, |m| m.min(n)) reads as “if we haven’t seen a value yet, the new minimum is n; otherwise keep the smaller of the existing min and n.” No null checks, no sentinels — just the Option API doing the bookkeeping.
The .unwrap() here is the rare justified one: classify has already proven every value parses, but the type system can’t carry that proof across the function boundary, so we assert it with a comment. That comment is the discipline — an unexplained .unwrap() is a future panic waiting to happen.
Wiring it together — and running it
Now the whole thing. build_reports is the iterator pipeline that ties columns to their stats; print_table formats; run is the readable success-path; main is tiny — it just reports an error to stderr and sets the exit code.
This is the headline. Hit Run — it compiles and executes on the real Rust compiler and prints the exact table promised above. Then edit DATA: add a row, change a salary to text, watch the column flip to text.
build_reports is the curriculum in a single expression: .iter().enumerate().map(...).collect() walks the column names, and inside the closure it borrows the rows to gather one column’s values, classifies them, and summarises them into a ColumnReport. main does almost nothing — it calls run, and on an error prints it and exits non-zero. That tiny main with most logic behind Result-returning functions is the idiomatic Rust shape.
The error path — make it fail on purpose
The program treats a ragged row as a typed error, not a panic. Replace DATA with a table whose second row is missing a cell and the parse function returns Err(AppError::Parse(...)), which bubbles up through ?, gets printed to stderr by main, and exits with code 1.
Run this and read the message — it names the offending row:
No exception unwinds, no null slips through. The failure is an ordinary value of type Result, and the only way to read the rows is to handle the Err first. That’s lesson 10’s whole promise, paying off in a real program.
Where Rust draws the line — the move error returns
One last reminder that the borrow checker never sleeps, not even in your capstone. build_reports takes columns: Vec<String> by value — it owns the vector. So once you pass columns in, you can’t use it again. This is exactly the move from lesson 4, now in production context.
Hit Run and read the error; it’s the same E0382 you met at the start of the ownership lessons:
The fix is the lesson-5 reflex: either borrow (build_reports(&columns) and take &[String]) or clone if you genuinely need two owners. The compiler won’t let you ship the bug. Twelve lessons later, that error is no longer scary — it’s a familiar nudge toward the right design.
What you just used
Walk back through the curriculum and find each piece in the program above:
- Variables, types, mutation (lesson 2):
let,let mut stats, the annotationlet n: f64. - Control flow (3):
matchas an expression,if let Err(e), exhaustive enum matching. - Ownership & references (4, 5):
parse(&input)andsummarise(&values, ...)borrow;build_reports(columns, rows)consumes. - Slices (6):
&[&str],&[ColumnReport],.as_str(). - Structs & enums (7): three structs, two enums,
derive(Default), a hand-writtenDisplay. - Traits & generics (8): implementing
Display;Option::map_orandResult::is_okfrom the trait APIs. - Closures & iterators (9):
.map,.collect,.enumerate,.all,.iter. - Errors (10):
Result<_, AppError>, the?operator,ok_or_else.
The point isn’t that this is the world’s best CSV summariser. It’s that every step felt unsurprising — you reached for the same pieces in the same shapes, and the compiler caught the typos and type mismatches before the program ever ran.
How you'd productionize this for real, large files
The embedded-string version is honest about its logic but not its scale. To turn it into a tool that survives a multi-gigabyte CSV:
Stream instead of slurp. parse reads the whole input into Vec<Vec<String>> — fine for a sample, fatal for a 50 GB file. The fix is to read line-by-line with a BufReader over the file (io::BufReader::new(File::open(path)?).lines()) and update each column’s running ColumnStats as you go, never holding more than one row in memory. Min / max / sum / count are all single-pass accumulators, so this works without buffering.
Use csv + serde. Our split(',') breaks the instant a field contains a comma inside quotes ("Smith, Jr.") or an embedded newline. The csv crate handles RFC 4180 quoting correctly and can deserialize each row straight into a struct via serde.
Handle malformed rows by policy, not panic. Real data is dirty. Decide up front: skip the bad row and count it, or abort with a typed error like we do. A --strict flag (parsed by clap) toggling between the two is a friendly touch.
Parallelise the columns. Lesson 11’s tools apply directly: split the file into chunks, give each chunk to a thread, have each thread build partial ColumnStats, then merge the partials. min/max/sum/count are associative — you can combine two partial summaries into one — which is exactly what makes the parallel version correct. Wrap the shared accumulator in Arc<Mutex<...>> and you’ve used the last lesson in the last program.
To read from a real file
On your own machine, with a filesystem, the only change is the source of the bytes. Swap the const DATA and parse(DATA)? lines for:
fn main() {
if let Err(e) = run() {
eprintln!("error: {e}");
std::process::exit(1);
}
}
fn run() -> Result<(), AppError> {
let path = std::env::args().nth(1);
let input = match path {
Some(p) => std::fs::read_to_string(p).map_err(|e| AppError::Parse(e.to_string()))?,
None => return Err(AppError::Parse("usage: summarise <file.csv>".into())),
};
let (columns, rows) = parse(&input)?;
print_table(&build_reports(columns, rows));
Ok(())
}
Everything downstream of parse — the classification, the stats, the table — is untouched. That separation between “where the data came from” and “what we do with it” is itself a design lesson: keep the source at the edges and the logic pure in the middle.
Key takeaways
- A capstone is composition: ownership, borrowing, structs, enums, traits, iterators, and
Resultare not separate features but one vocabulary. - A tiny
mainplusResult-returning functions is the idiomatic shape —?makes the success path read like ordinary code. - Inference, stats, and formatting each became one small, borrowing function; the iterator pipeline in
build_reportsstitched them together. - Running your code catches what reading it can’t — the
Display-padding bug was invisible until the real compiler ran it. - When
stdruns out, the ecosystem (csv,serde,clap,anyhow) extends the same patterns; the shapes you learned carry straight over.
Look back at the program one more time. The ownership rules from lesson 4 decided who could touch the data. Borrowing from lesson 5 let the functions read without taking. Traits gave it a face to print; iterators gave it a pipeline; Result and ? gave it a way to fail honestly. Even concurrency was waiting in the wings.
It was all there, in one small file you can now read and write yourself. That was the whole point. The compiler will keep arguing with you — and every time, it’ll be teaching you something. Go build something.