Three seconds of audio is enough. That is not a marketing line — it is the figure Microsoft's own researchers published in their 2023 VALL-E paper. Here is what those three seconds actually let a criminal do, where they come from, and how to do a five-minute audit of your household's public audio.
What three seconds of audio can actually do
The "three seconds" figure is not folklore. Microsoft Research's VALL-E paper, published in 2023, describes a system that can "synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt" (see the public paper at arxiv.org/abs/2301.02111). That sentence is the story?king benchmark every consumer voice-cloning service has been built against.
What three seconds gets a criminal: timbre, pitch contour, and the basic prosodic envelope of a target voice — enough to generate arbitrary spoken sentences in that voice from typed text. The clone is not perfect. A trained ear in a quiet room can sometimes catch a slight metallic edge, breathing pauses in the wrong places, vowel transitions that do not match the speaker's normal pattern. The problem is that almost no one is in a quiet room with a trained ear when the call comes. The Federal Trade Commission's consumer alert on AI-enhanced family emergency scams puts it plainly: "All [the scammer] needs is a short audio clip of your family member's voice — which he could get from content posted online — and a voice-cloning program. When the scammer calls you, he'll sound just like your loved one" (consumer.ftc.gov).
Where criminals find three seconds of your voice
Mostly: places families do not think of as "voice databases."
- Public TikTok videos and Instagram Reels of children, grandchildren, and the parent themselves
- YouTube uploads of wedding speeches, graduation toasts, retirement parties
- Podcast appearances, even short interviews on local community shows
- Public Zoom recordings — class reunions, community-board meetings, church services
- Voicemail greetings on numbers that show up in data breaches
- Sermons, eulogies, and political-campaign volunteer appearances
- Any livestream where the speaker is identified by name
The volume that ends up in criminal hands is not a guess. The FBI's Internet Crime Complaint Center recorded 101,068 complaints from victims age 60 and over in 2023, with reported losses of $3.4 billion and 5,920 victims who lost more than $100,000 each (2023 IC3 Elder Fraud Report). The grandparent and family-emergency variants run on exactly the audio anyone can scrape off a public profile in an afternoon.
What a three-second clone sounds like next to the original
Audio comparison is the most persuasive way to internalize this, which is why we keep a small library of synthetic-voice samples in the Resources library. The general pattern: a listener who has been told to listen for the clone will usually catch it. A listener answering an unexpected call at dinnertime almost never does.
That is the asymmetry the scam exploits. A family that has been briefed in advance is much harder to fool. The briefing is the product.
How to audit your own public audio in five minutes
The goal is not zero footprint — that is unrealistic and not the point. The goal is to know what is out there and to remove or restrict the highest-risk clips.
- Search your name on the major platforms.
TikTok, Instagram Reels, YouTube Shorts, YouTube long-form, Facebook video. Search your name, your children's names, and your parents' names. Set aside ten minutes.
- Note every clip where any family member's voice is audible for three seconds or more.
Most families find more than they expected. Wedding speeches, graduations, and tribute videos are the three largest categories.
- Set the highest-risk clips to friends-only, or remove them.
If a clip belongs to someone outside the immediate family — a venue, a photographer, a friend's account — ask them. Most people say yes once you explain why.
- Repeat for podcasts, public Zoom recordings, and livestream archives.
Local-community recordings tend to be the long tail. Church services and HOA meetings are the most-overlooked sources.
- Replace any clip you cannot delete with a captioned silent version.
For tribute videos and similar, a captioned re-cut preserves the moment without the voice. Most platforms allow an edit that swaps the audio track.
The defense is not zero footprint. The defense is a question only your family knows.
What if it is already too late to scrub
Mostly, it is too late to scrub. Most families have decades of public audio. The voice is out there.
That is fine. Cleanup helps at the margin; the load-bearing defense was never going to be cleanup. The load-bearing defense is a question your family agrees on, paired with a private number that reaches your family. Even if a perfect clone of every family member's voice already exists, the clone cannot answer the question, and the hotline rings the family members you chose — not whoever spoofed the inbound caller ID.
If a call has already happened, file the report at reportfraud.ftc.gov and a second one at ic3.gov. The AARP Fraud Watch Network helpline (1-877-908-3360, staffed by trained volunteers) is worth saving on the same call (aarp.org/money/scams-fraud/helpline). We walk the rest of the response through in our first-hour piece.
Where to go next
The Resources library has the audio samples, the fridge card template, and the printable first-hour checklist. When you want to make the kitchen-table conversation a thing the household has actually had, The Family Word kit is $59 with free US shipping, and the hotline rings the family members you choose the day it arrives.