Learn Creative Coding (#103) - Mini-Project: ML-Powered Interactive Installation

At the end of last episode I told you to bring your webcam, your mic, and an idea for a room you'd like to bring to life. So here we are. This is the one where we stop learning the instruments one by one and actually play a song with all of them at once.

For ten episodes now the network has learned new tricks. It learned to watch (96), to read bodies (93), hands (94), faces (95), to be trained by us with our own examples (100), to listen (101), and to measure how close two things are in meaning (102). Every one of those was a little island. Today we build the bridge between them and make something that feels alive in a room. I'm going to call it a "magic room" because that's genuinly what it feels like when it works - you walk in, you move, you make noise, you hold something up, and the walls answer.

This is portfolio work. By the end you'll have an interactive installation you designed, coded and tested yourself. That's a real thing you can show people. Allez, let me show you how I think about building one of these :-).

The shape of an installation

Before a single line of code, you need a mental picture of the pipeline. Every ML installation I've built has the same four layers, and once you see them you'll see them everywhere.

INPUT      ->   ML        ->   MAPPING    ->   RENDER
webcam          pose            "where you      generative
microphone      hands           stand drives    visuals
                classify        the center"      on screen
                audio

Input is your raw senses - pixels from the camera, samples from the mic. The ML layer turns that raw mess into meaning: a skeleton, a hand, a label, a loudness number. The mapping layer is where the art actually lives - it decides that your right hand becomes a brush, that loudness becomes brightness, that holding up a cup switches the whole mood. Render just draws the result.

Most beginners pour all their effort into render and treat mapping as an afterthought. It's backwards. The mapping layer is the installation. The same pose data can drive a calm breathing field or a violent particle storm - same input, completely different soul, and the difference is all in the mapping. Keep that in your head the whole way through.

Wiring up the senses

Let's get the input layer running. We need the camera and the mic going at the same time, which sounds scary but p5 makes it gentle.

let video;
let mic;

function setup() {
  createCanvas(960, 540);

  // the eye: a webcam feed we hide and feed to the models
  video = createCapture(VIDEO);
  video.size(640, 360);
  video.hide();   // we draw our own visuals, not the raw feed

  // the ear: one microphone, giving us a live amplitude reading
  mic = new p5.AudioIn();
  mic.start();
}

That's both senses open. The camera will feed every vision model we load, the mic gives us a constant stream of "how loud is it right now". Notice we hide the video - in an installation you almost never show the raw camera, that breaks the spell. The camera is a sensor, not a picture.

The base layer: your body is the structure

We start with pose, because the body is the broadest, most reliable signal. Where you stand, how big you are in frame, how you move - that's the skeleton (ha) the whole piece hangs on. We met pose detection back in episode 93, so this'll feel familiar.

let bodyPose;
let poses = [];

function preload() {
  bodyPose = ml5.bodyPose();   // the pose model from episode 93
}

function startPose() {
  // ml5 hands us an array of detected people, each with keypoints
  bodyPose.detectStart(video, results => {
    poses = results;
  });
}

Now the important bit, the mapping. I'm not going to draw a skeleton - that's a debug view, not art. Instead I'll pull two numbers out of the body that mean something: where the center of the body is, and how spread-out the person is. Those two become the heart of the visuals.

// turn a raw pose into two creative parameters:
// a center point (where attention goes) and an "energy" (how big/open)
function bodyToParams(pose) {
  let nose = pose.keypoints[0];          // head position
  let lWrist = pose.keypoints[9];        // left wrist
  let rWrist = pose.keypoints[10];       // right wrist

  // center of the visual = where the head is, mapped to canvas
  let cx = map(nose.x, 0, video.width, 0, width);
  let cy = map(nose.y, 0, video.height, 0, height);

  // "energy" = how far apart the wrists are. Arms wide = high energy.
  let spread = dist(lWrist.x, lWrist.y, rWrist.x, rWrist.y);
  let energy = map(spread, 40, 400, 0, 1, true);  // clamp to 0..1

  return { cx, cy, energy };
}

See what happened there? The messy 17-point skeleton became two ideas: a focus point and an energy level. That's the craft of mapping - boiling a rich signal down to a couple of expressive dials. Now I can drive a field of particles from those dials.

let field = [];   // particles that follow the body

function drawBodyField(p) {
  // p.cx, p.cy is where the body "is"; p.energy is 0..1
  for (let particle of field) {
    // pull every particle gently toward the body center
    let ax = (p.cx - particle.x) * 0.002;
    let ay = (p.cy - particle.y) * 0.002;
    particle.vx = (particle.vx + ax) * 0.96;
    particle.vy = (particle.vy + ay) * 0.96;
    particle.x += particle.vx * (1 + p.energy * 4);  // energy = speed
    particle.y += particle.vy * (1 + p.energy * 4);

    // energy also opens up the color and size
    let bright = 80 + p.energy * 175;
    noStroke();
    fill(bright, bright * 0.6, 200, 180);
    circle(particle.x, particle.y, 2 + p.energy * 6);
  }
}

Stand still and the particles gather quietly around you. Throw your arms open and they scatter fast and bright. The body is now the structure of the piece, exactly like we planned in that little pipeline diagram. This is the same particle thinking from episode 11, just steered by a human instead of by noise().

The detail layer: hands add precision

Body gives you broad strokes. Hands (episode 94) give you fine control on top. The trick is to treat them as two different resolutions of the same person - the body paints, the hand draws. When a hand shows up, I let it do something delicate that the whole-body signal can't.

let handPose;
let hands = [];

// run the hand model alongside the pose model on the same video
function startHands() {
  handPose = ml5.handPose();
  handPose.detectStart(video, results => {
    hands = results;
  });
}

// a pinch = thumb tip and index tip close together
function pinchAmount(hand) {
  let thumb = hand.keypoints[4];
  let index = hand.keypoints[8];
  let d = dist(thumb.x, thumb.y, index.x, index.y);
  // small distance -> pinch ~1, open hand -> pinch ~0
  return map(d, 20, 120, 1, 0, true);
}

A pinch is one of those gestures that just feels right to people - everybody understands grabbing. So I'll let a pinch gather the particles into a tight knot, like you're pulling a drawstring. No instructions needed, visitors discover it in seconds.

function applyHands() {
  for (let hand of hands) {
    let index = hand.keypoints[8];
    let hx = map(index.x, 0, video.width, 0, width);
    let hy = map(index.y, 0, video.height, 0, height);
    let pinch = pinchAmount(hand);

    if (pinch > 0.5) {
      // pull nearby particles toward the fingertip, hard
      for (let particle of field) {
        let d = dist(particle.x, particle.y, hx, hy);
        if (d < 200) {
          particle.x += (hx - particle.x) * 0.2 * pinch;
          particle.y += (hy - particle.y) * 0.2 * pinch;
        }
      }
    }
  }
}

Body for the broad gesture, hand for the precise one. They layer instead of fighting because they map to different scales of behaviour. That layering idea is the secret to making an installation feel deep rather than gimmicky.

The mood layer: sound sets the temperature

Now the ear. We don't need fancy speech recognition for the base mood - raw loudness already does a huge amount of emotional work. A quiet room should feel different from a loud one. We pull one number off the mic and let it set the overall temperature of the piece.

let smoothedLevel = 0;

function audioLevel() {
  let raw = mic.getLevel();          // 0..1ish, jumpy
  // smooth it so the visuals breathe instead of flickering
  smoothedLevel = lerp(smoothedLevel, raw, 0.15);
  return smoothedLevel;
}

That lerp smoothing is doing quiet heavy lifting - raw mic data is twitchy and ugly, and if you map it straight to brightness the whole screen strobes. Easing it (our friend from episode 16) turns a nervous signal into a breath. I learned that one the hard way at an exhibition where the unsmoothed version gave everyone a headache, oops.

// fold loudness into the whole scene as a background "temperature"
function drawMood(level) {
  // louder room -> warmer, more saturated backdrop
  let warmth = map(level, 0, 0.3, 0, 1, true);
  let bg = lerpColor(color(8, 10, 30), color(60, 20, 25), warmth);
  background(bg);
}

For the sharp stuff - a clap, a shout - we can reuse the sound classification from episode 101 to fire discrete events instead of reading a continuous level. A detected clap becomes a flash, a trigger, a punctuation mark.

// a sound classifier (episode 101) firing one-shot events
function startSoundEvents(classifier) {
  classifier.classifyStart(result => {
    let top = result[0];
    if (top.label === 'clap' && top.confidence > 0.85) {
      triggerFlash();   // discrete bang, not a continuous level
    }
  });
}

So sound works on two timescales at once: a slow continuous level that sets the mood, and fast discrete events that punctuate it. Continuous for atmosphere, discrete for drama. That two-speed thinking applies to almost every signal in an installation.

The wildcard layer: show it something

Here's the layer that makes people gasp. Hold an object up to the camera, and the whole visual style changes. We use image classification (episode 96) so the installation recognises what you're showing it and reacts to the category.

let classifier;
let currentObject = null;

function startClassifier() {
  classifier = ml5.imageClassifier('MobileNet');
  classifyLoop();
}

function classifyLoop() {
  classifier.classify(video, (results) => {
    let top = results[0];
    // only switch mode on a confident, stable guess
    if (top.confidence > 0.6) {
      currentObject = top.label;
    }
    classifyLoop();   // keep going, frame after frame
  });
}

Now I map broad categories of object to whole visual modes. I don't need every label - I just bucket them into a handful of moods. Show something blue and watery, get a cool flowing mode. Show something with a face or a creature, get an organic mode.

// collapse hundreds of MobileNet labels into a few visual moods
function objectToMode(label) {
  if (!label) return 'default';
  let l = label.toLowerCase();
  if (l.match(/water|ocean|fountain|bottle/)) return 'flow';
  if (l.match(/flower|leaf|tree|plant/))      return 'bloom';
  if (l.match(/book|paper|envelope/))         return 'text';
  return 'default';
}

This is where the embeddings idea from last episode (102) would level you up, by the way - instead of brittle keyword matching you'd compare the image's embedding against a few reference vectors and pick the nearest mood. Same concept, smoother result. But keyword buckets are a perfectly good place to start, and you can ship them today.

The conductor: a state machine of scenes

With four input layers feeding in, you need something deciding what the piece is at any moment. That's a state machine - we built these back in episode 17, and an installation is honestly the perfect home for one. Each state is a "scene" with its own personality.

let scene = 'ambient';
let sceneTimer = 0;

function updateScene(params, level) {
  sceneTimer++;

  // transitions: sustained energy or a recognised object can switch scenes
  if (scene === 'ambient' && params.energy > 0.7 && sceneTimer > 60) {
    scene = 'energetic';
    sceneTimer = 0;
  } else if (scene === 'energetic' && params.energy < 0.2 && sceneTimer > 120) {
    scene = 'ambient';
    sceneTimer = 0;
  }

  // an object override can drop us into a special intimate mode
  if (objectToMode(currentObject) === 'bloom') {
    scene = 'intimate';
    sceneTimer = 0;
  }
}

The timers matter. Without them the piece flickers between scenes every time you twitch, which feels broken. A scene should commit for a moment - sustained energy moves you somewhere, a single jerk doesn't. That patience is what separates a toy from an installation. Makes sense, right?

function draw() {
  let pose = poses[0];
  if (!pose) { drawIdle(); return; }   // nobody home, breathe gently

  let params = bodyToParams(pose);
  let level = audioLevel();

  updateScene(params, level);
  drawMood(level);

  // each scene renders the shared signals its own way
  if (scene === 'ambient')   drawBodyField(params);
  if (scene === 'energetic') { drawBodyField(params); applyHands(); }
  if (scene === 'intimate')  drawIntimate(params);
}

One body of input, three different faces depending on the scene. The state machine is the conductor telling the orchestra which piece to play.

Designing for failure (this is the real lesson)

Here's the part the tutorials never tell you. ML models misfire. Pose detection loses you when you turn sideways. The classifier suddenly decides your hand is a "bath towel". The mic picks up a truck outside. In a demo on your laptop you just shrug. In an installation running for eight hours with strangers walking through, every misfire is visible and your piece looks broken.

So you design for graceful degradation. The rule: when a signal disappears, the visuals should soften, never crash or freeze.

function drawIdle() {
  // no body detected: don't freeze, don't error - just breathe
  // slowly drift the existing particles and fade toward calm
  for (let particle of field) {
    particle.x += sin(frameCount * 0.01 + particle.y) * 0.3;
    particle.y += 0.2;
    if (particle.y > height) particle.y = 0;
    noStroke();
    fill(60, 70, 120, 120);
    circle(particle.x, particle.y, 2);
  }
}

When pose tracking drops, we don't throw - we fall into this gentle ambient drift. A visitor who steps out of frame sees the room sigh and settle, not a stack trace. When they step back in, the body field picks them up again. The seam is invisible. That single design choice is the difference between "cool student project" and "thing that can actually run unattended in a gallery".

Guarding every model read the same way is just good hygiene:

// never trust that a keypoint exists - models return partial data
function safeKeypoint(pose, index) {
  let kp = pose && pose.keypoints && pose.keypoints[index];
  if (!kp || kp.confidence < 0.3) return null;   // low confidence = ignore
  return kp;
}

Low-confidence keypoints are worse than missing ones, because they jitter convincingly. Throwing them away below a confidence floor is what keeps the whole piece calm.

When the room fills up

The last bit of magic: the installation should scale with the audience. One person is a solo. Two people, draw a line of light between their bodies - suddenly it's a duet and strangers start playing with each other. Five people and you let the group dynamics emerge.

function drawConnections() {
  // a glowing thread between every pair of detected bodies
  for (let i = 0; i < poses.length; i++) {
    for (let j = i + 1; j < poses.length; j++) {
      let a = poses[i].keypoints[0];   // head of person i
      let b = poses[j].keypoints[0];   // head of person j
      let ax = map(a.x, 0, video.width, 0, width);
      let ay = map(a.y, 0, video.height, 0, height);
      let bx = map(b.x, 0, video.width, 0, width);
      let by = map(b.y, 0, video.height, 0, height);
      stroke(200, 180, 255, 140);
      strokeWeight(2);
      line(ax, ay, bx, by);
    }
  }
}

That's maybe fifteen lines, and it completely changes the social feel of the room. People who came in separately end up laughing and waving their arms to make the threads dance. You didn't program that interaction - it emerged from one simple rule, which is the whole spirit of this series.

The exercise: build your room

Allez, here's your assignment, and it's a real one. Build a simplified version of this with at least two ML inputs - pose plus sound, or pose plus classification, your pick. Design three visual modes and make the state machine transition between them. Test it with more than one person in frame. And critically: handle failure gracefully, so when tracking drops the piece softens instead of breaking.

Then document it. Record a short video of it responding to you - this is the single most portfolio-worthy thing you'll make in the whole series, and a video of a room reacting to a human is worth a hundred screenshots. You designed it, you coded it, you tested it with real people. That's not a tutorial exercise anymore, that's a piece.

If you want a stretch goal: swap the brittle keyword buckets for embedding-based mood matching (102), or train a custom gesture classifier in Teachable Machine (100) so the room responds to poses you invented. The toolbox is all there now. The only limit left is what room you want to bring to life.

't Komt erop neer...

Every ML installation has the same four layers: INPUT (camera, mic) -> ML (pose, hands, classify, audio) -> MAPPING (turning ML output into creative dials) -> RENDER. The mapping layer is where the art actually lives - the same pose data can drive a calm field or a violent storm
The body (pose, episode 93) is your base structure: boil the 17-point skeleton down to a couple of expressive parameters - a focus point and an "energy" level - and drive the visuals from those, not from a literal skeleton
Hands (94) add a finer resolution on top of the body. Body paints broad strokes, hand does precise gestures like a pinch-to-gather. They layer instead of fighting because they map to different scales
Sound works on two timescales: a smoothed continuous loudness sets the mood/temperature, while discrete classified events (clap, shout - episode 101) punctuate it. Always smooth raw mic data with lerp or your screen strobes
Image classification (96) is the wildcard: recognise objects held to the camera and bucket hundreds of labels into a few visual "modes". Embeddings (102) make this smoother than keyword matching
A state machine (episode 17) is the conductor - each scene is a personality, and transitions need timers so the piece commits to a mood instead of flickering on every twitch
Design for failure. Models misfire constantly; when a signal drops, the visuals must soften (a gentle idle drift), never crash or freeze. Guard every keypoint read and throw away low-confidence data. This is what lets a piece run unattended
Scale with the audience: one rule that draws threads between detected bodies turns solo visitors into a playing crowd. The social magic emerges from simple rules
The exercise is portfolio-grade: build an installation with two ML inputs, three modes, graceful failure, tested with multiple people, documented on video

And that's the ML arc landing. We spent ten episodes teaching a network to watch, read, classify, generate, listen and measure - and today we put the whole toolbox on one table and built a living room out of it. Notice something though: the moment the machine is this deep in the creative loop, a question starts knocking. When the room makes a choice you didn't explicitly program, whose work is it - yours, the model's, the people in the room? That's not a coding question, it's a harder one, and it's exactly where we're headed next. We've also been stuck behind glass this whole series, everything happening on a screen - I think it's about time we let the code out into the physical world.

Sallukes! Thanks for reading.

@femdev