← all lessons
Capstone · #12 of 13

Capstone: A CSV Summariser

Pulling everything together into one small, real program

Twelve lessons ago you printed hello, world. You now know enough to build a real, complete program — one that reads data, decides what it’s looking at, computes statistics, handles its own errors, and prints a clean report.

This lesson builds exactly that. Every concept you’ve met shows up in one small file.

A capstone isn’t new material. It’s the moment the separate pieces — ownership, references, structs, enums, traits, iterators, Result, the ? operator — stop being individual lessons and become a way of writing programs. We’ll build a CSV summariser: give it a table of comma-separated values, and it tells you, per column, whether that column holds integers, floating-point numbers, or text, plus the count / min / max / mean for the numeric ones.

It’s small enough to fit on this page and big enough to touch nearly every idea from the curriculum. Read it once for the shape, then run it, then change it.

A few words you’ll need

What we’re building

Feed the program this table:

name,age,salary
Ada,36,125000.00
Grace,57,98500.50
Linus,42,48000.00
Dennis,21,42000.00

and it prints:

column    type    count   min         max         mean
name      text    4       -           -           -
age       int     4       21.00       57.00       39.00
salary    float   4       42000.00    125000.00   78375.12

The logic, in four steps:

  1. Read the CSV text.
  2. For each column, look at every value and decide: integer, floating-point, or text.
  3. For numeric columns, compute count / min / max / mean.
  4. Print a fixed-width table. Any failure (empty input, a ragged row) becomes a typed error, a one-line message on stderr, and a non-zero exit code.

One column, three guesses

Before the whole program, here’s its beating heart in isolation: a classify function that walks a column’s values and guesses its type. The rule is simple and greedy — if every value parses as an integer, it’s an int; else if every value parses as a float, it’s a float; otherwise it’s text.

Hit Run and watch three different columns get three different verdicts:

Type inference, in miniature editable · real rustc
Open in Playground ↗ ready

Notice how much of the curriculum is already here. &[&str] is a slice (lesson 6) — the function borrows the values, it doesn’t own them. .iter() and .all() are iterator methods (lesson 9); .all() short-circuits the moment one value fails to parse. v.parse::<i64>() returns a Result (lesson 10), and .is_ok() collapses it to a plain bool. Four lessons, four lines.

The types

A handful of types do most of the work in the full program. Read them slowly — every line reaches back to an earlier lesson.

use std::fmt;

#[derive(Debug)]
enum ColumnType {
    Int,
    Float,
    Text,
}

impl fmt::Display for ColumnType {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        match self {
            ColumnType::Int => write!(f, "int"),
            ColumnType::Float => write!(f, "float"),
            ColumnType::Text => write!(f, "text"),
        }
    }
}

#[derive(Debug, Default)]
struct ColumnStats {
    count: usize,
    min: Option<f64>,
    max: Option<f64>,
    sum: f64,
}

#[derive(Debug)]
struct ColumnReport {
    name: String,
    kind: ColumnType,
    stats: ColumnStats,
}

#[derive(Debug)]
enum AppError {
    Parse(String),
}

Things worth noticing before we run anything:

Parsing the table

fn parse(input: &str) -> Result<(Vec<String>, Vec<Vec<String>>), AppError> {
    let mut lines = input.lines();
    let header = lines.next().ok_or_else(|| AppError::Parse("empty input".into()))?;
    let columns: Vec<String> = header.split(',').map(|s| s.trim().to_string()).collect();

    let mut rows: Vec<Vec<String>> = Vec::new();
    for (i, line) in lines.enumerate() {
        let cells: Vec<String> = line.split(',').map(|s| s.trim().to_string()).collect();
        if cells.len() != columns.len() {
            return Err(AppError::Parse(format!(
                "row {} has {} cells, expected {}",
                i + 2, cells.len(), columns.len()
            )));
        }
        rows.push(cells);
    }
    Ok((columns, rows))
}

We borrow input: &str rather than take ownership — the caller keeps its string. .lines() is a lazy iterator; .next() peels off the header. .enumerate() adds an index so an error can name the real row number (i + 2, because the header is row 1 and enumerate starts at 0).

ok_or_else is the bridge from Option to Result: it turns “there was no header line” (None) into our typed Err(AppError::Parse(...)), which the ? then propagates.

Inferring and summarising

fn classify(values: &[&str]) -> ColumnType {
    if values.iter().all(|v| v.parse::<i64>().is_ok()) {
        return ColumnType::Int;
    }
    if values.iter().all(|v| v.parse::<f64>().is_ok()) {
        return ColumnType::Float;
    }
    ColumnType::Text
}

fn summarise(values: &[&str], kind: &ColumnType) -> ColumnStats {
    let mut stats = ColumnStats::default();
    stats.count = values.len();

    if matches!(kind, ColumnType::Text) {
        return stats;
    }

    for v in values {
        let n: f64 = v.parse().unwrap(); // safe: classify already proved this parses
        stats.sum += n;
        stats.min = Some(stats.min.map_or(n, |m| m.min(n)));
        stats.max = Some(stats.max.map_or(n, |m| m.max(n)));
    }
    stats
}

summarise shows a tidy Option pattern: stats.min.map_or(n, |m| m.min(n)) reads as “if we haven’t seen a value yet, the new minimum is n; otherwise keep the smaller of the existing min and n.” No null checks, no sentinels — just the Option API doing the bookkeeping.

The .unwrap() here is the rare justified one: classify has already proven every value parses, but the type system can’t carry that proof across the function boundary, so we assert it with a comment. That comment is the discipline — an unexplained .unwrap() is a future panic waiting to happen.

Wiring it together — and running it

Now the whole thing. build_reports is the iterator pipeline that ties columns to their stats; print_table formats; run is the readable success-path; main is tiny — it just reports an error to stderr and sets the exit code.

This is the headline. Hit Run — it compiles and executes on the real Rust compiler and prints the exact table promised above. Then edit DATA: add a row, change a salary to text, watch the column flip to text.

The whole program — runs end-to-end editable · real rustc
Open in Playground ↗ ready

build_reports is the curriculum in a single expression: .iter().enumerate().map(...).collect() walks the column names, and inside the closure it borrows the rows to gather one column’s values, classifies them, and summarises them into a ColumnReport. main does almost nothing — it calls run, and on an error prints it and exits non-zero. That tiny main with most logic behind Result-returning functions is the idiomatic Rust shape.

The error path — make it fail on purpose

The program treats a ragged row as a typed error, not a panic. Replace DATA with a table whose second row is missing a cell and the parse function returns Err(AppError::Parse(...)), which bubbles up through ?, gets printed to stderr by main, and exits with code 1.

Run this and read the message — it names the offending row:

Errors are values, not crashes editable · real rustc
Open in Playground ↗ ready

No exception unwinds, no null slips through. The failure is an ordinary value of type Result, and the only way to read the rows is to handle the Err first. That’s lesson 10’s whole promise, paying off in a real program.

Where Rust draws the line — the move error returns

One last reminder that the borrow checker never sleeps, not even in your capstone. build_reports takes columns: Vec<String> by value — it owns the vector. So once you pass columns in, you can’t use it again. This is exactly the move from lesson 4, now in production context.

Hit Run and read the error; it’s the same E0382 you met at the start of the ownership lessons:

Borrow of moved value — a compile error editable · real rustc
Open in Playground ↗ ready

The fix is the lesson-5 reflex: either borrow (build_reports(&columns) and take &[String]) or clone if you genuinely need two owners. The compiler won’t let you ship the bug. Twelve lessons later, that error is no longer scary — it’s a familiar nudge toward the right design.

What you just used

Walk back through the curriculum and find each piece in the program above:

The point isn’t that this is the world’s best CSV summariser. It’s that every step felt unsurprising — you reached for the same pieces in the same shapes, and the compiler caught the typos and type mismatches before the program ever ran.

How you'd productionize this for real, large files

The embedded-string version is honest about its logic but not its scale. To turn it into a tool that survives a multi-gigabyte CSV:

Stream instead of slurp. parse reads the whole input into Vec<Vec<String>> — fine for a sample, fatal for a 50 GB file. The fix is to read line-by-line with a BufReader over the file (io::BufReader::new(File::open(path)?).lines()) and update each column’s running ColumnStats as you go, never holding more than one row in memory. Min / max / sum / count are all single-pass accumulators, so this works without buffering.

Use csv + serde. Our split(',') breaks the instant a field contains a comma inside quotes ("Smith, Jr.") or an embedded newline. The csv crate handles RFC 4180 quoting correctly and can deserialize each row straight into a struct via serde.

Handle malformed rows by policy, not panic. Real data is dirty. Decide up front: skip the bad row and count it, or abort with a typed error like we do. A --strict flag (parsed by clap) toggling between the two is a friendly touch.

Parallelise the columns. Lesson 11’s tools apply directly: split the file into chunks, give each chunk to a thread, have each thread build partial ColumnStats, then merge the partials. min/max/sum/count are associative — you can combine two partial summaries into one — which is exactly what makes the parallel version correct. Wrap the shared accumulator in Arc<Mutex<...>> and you’ve used the last lesson in the last program.

To read from a real file

On your own machine, with a filesystem, the only change is the source of the bytes. Swap the const DATA and parse(DATA)? lines for:

fn main() {
    if let Err(e) = run() {
        eprintln!("error: {e}");
        std::process::exit(1);
    }
}

fn run() -> Result<(), AppError> {
    let path = std::env::args().nth(1);
    let input = match path {
        Some(p) => std::fs::read_to_string(p).map_err(|e| AppError::Parse(e.to_string()))?,
        None => return Err(AppError::Parse("usage: summarise <file.csv>".into())),
    };
    let (columns, rows) = parse(&input)?;
    print_table(&build_reports(columns, rows));
    Ok(())
}

Everything downstream of parse — the classification, the stats, the table — is untouched. That separation between “where the data came from” and “what we do with it” is itself a design lesson: keep the source at the edges and the logic pure in the middle.

Key takeaways

Look back at the program one more time. The ownership rules from lesson 4 decided who could touch the data. Borrowing from lesson 5 let the functions read without taking. Traits gave it a face to print; iterators gave it a pipeline; Result and ? gave it a way to fail honestly. Even concurrency was waiting in the wings.

It was all there, in one small file you can now read and write yourself. That was the whole point. The compiler will keep arguing with you — and every time, it’ll be teaching you something. Go build something.