Have you ever wondered how your brain predicts who's at the door before you even open it? This blog explores the concept of predictive models, using a simple doorbell scenario as an example. We'll dive into how data science can help us formalise this intuitive process, from collecting data to making predictions.
It’s early evening. You’re enjoying the warmth of a cup of tea and the sky is getting darker. From your window, flashes of a red coat against the grey street tell you little Kay from next door is playing outside. You start thinking about what to do for dinner. DING! Ah, someone’s at the door! Who can it be? You’re not expecting a package. Is Kay pranking you again
Without realising it, you just have used a predictive model. You haven’t confirmed that someone is at the door yet. You just guessed it from the doorbell ring.
A model is a partial, limited representation of the world.
This becomes our Doorbell example case:
Now that we have formalised the General Doorbell Model, we can build on it. Namely, to answer the pending question: Who is at the door?
The data unknown is your visitor. The data known is; general knowledge about people who can show up on a doorstep, your context knowledge (time, day, weather, no package expected…) and your understanding of who has visited you before.
Now is the tricky model creation part. With context, what are the logical connections between the current unknown visitor and the past-known visitors?
The choice of a model starts by understanding its limits. We already have constraints from the data provided. A doorbell ringing can be described as an event. Event predictions have an entire branch in maths: statistics. Therefore there are still plenty of possible models.
I’m very tempted to do a “That’s all folks!” and leave it for the day. Choosing between those models is a full article by itself. For now, I will simplify the problem to get some simple predictions. These will be rough, unproven simplifications. Replicate this at home with utmost caution!
We want to guess your visitor in particular so let’s forget general knowledge and concentrate on your experience;
Here is the list of visitors that have rung your bell in the fictional house you reside in, over the past six months.
Movers, Alice (neighbour), Bob (neighbour), Package delivery, Package, Internet people, Bob, Package, Charlie (neighbour), Dee (friend), Package, Postman, Dee, Group of friends, Postman, No one, Group of friends, Postman, Group of friends, Bob, Census people, Dee, Postman, Postman, Kay (neighbour), Kay, Plumber, Kay, Kay, Kay, Dee, Postman, Package, Kay, Group of a friend, Kay, Kay, Jehovah witness, Kay, Dee, Alice, Kay, Kay, Package, Kay, Dee.
Categories and their number of occurrences, grouped in a manner to avoid 1 occurrence category and singling out the most frequent callers:
When those numbers are divided by the total number of occurrences, they give the probability that a random past visitor belongs to that category. With the model assumptions, that probability is the same as the probability of our current visitor belonging to that category.
With a 50–50 chance, our surprise caller is either Kay or a package/post-delivery.
This is quite a simple example, but it shows a classic Data Scientist thought process. Let’s break it down.
And Voila! A fictional doorbell event model based on Data Science.