The 3 AM Test: Building a Tailings Management System That Works When Nobody's Watching
The 3 AM Test: Building a Tailings Management System That Works When Nobody’s Watching
It’s 3:17 AM on a Sunday. The operations center is empty except for the night shift supervisor. He’s been checking his phone every ten minutes for the last hour, watching the storm tracking app. The rain started at midnight. Already 80mm and still coming down hard. Weather forecast missed this one - it was supposed to be a routine system. At 3:22 AM, his phone buzzes. Automated alert: “Piezometer P-14 exceeded Yellow threshold. Reading: 24.3 kPa. Previous: 19.1 kPa. Rate of rise: 5.2 kPa in 90 minutes.” He opens the monitoring dashboard. P-14 is in the critical zone on the south face. The TARP is clear: Yellow threshold requires RTFE notification and increased monitoring frequency. But it’s 3 AM on Sunday. The RTFE is at home, probably asleep. The superintendent is off rotation. The EOR is in another city. He has two choices: Choice 1: Wake people up, follow the TARP, look like he’s overreacting to a rainstorm. Choice 2: Monitor it himself, wait until morning shift to report it. Maybe it’ll stabilize. No need to panic. What happens next reveals whether you have a Tailings Management System or just tailings management documents.
GISTM Requirement 8.2 mandates: “Establish a tailings governance framework and a performance based TMS.” But here’s what that requirement doesn’t tell you: The true test of your TMS isn’t during normal operations, scheduled meetings, or when leadership is watching. It’s at 3 AM. During shift changes. On holidays. When experienced people are on vacation. When unusual conditions arise. When nobody’s looking. That’s when you discover whether you have systems that work - or systems that exist on paper. What “System” Actually Means (Most People Get This Wrong) What a system is NOT:
A binder full of procedures An organizational chart A software platform A set of requirements A compliance checklist
What a system IS:
Interconnected components that work together Feedback loops that enable self-correction Redundancy so single failures don’t cascade Forcing functions that make the right thing easy and the wrong thing hard Resilience to absorb unexpected conditions Learning capability that improves over time
The difference: Documents: “In the event of yellow TARP trigger, RTFE shall be notified within 24 hours.” System: Automated alert sends SMS to RTFE phone and backup phone, creates ticket in management system with timer, escalates automatically if not acknowledged within 2 hours, logs notification for audit trail, similar alerts trigger pattern recognition to see if multiple parameters showing concerns. See the difference? The document tells you what should happen. The system makes it happen. The Six Characteristics of Systems That Work at 3 AM Characteristic 1: Default to Safe (Not Default to Convenient) Systems that fail at 3 AM make the wrong choice the easy choice: Example:
TARP requires immediate notification But notification requires finding phone numbers, waking people up, explaining situation Easy choice: Wait until morning Safe choice: Notify now
Systems that work at 3 AM make the safe choice the easy choice: What this looks like:
Single button: “Trigger Yellow TARP Response” Button automatically notifies all required people via multiple channels (SMS, call, app notification) Creates incident record with timestamp Displays recommended immediate actions on screen Provides template for situation report Connects to relevant monitoring data automatically
Now the easy choice and the safe choice are the same. Real example from mine in Chile: Before system redesign:
TARP trigger required operator to call RTFE (phone number in binder) Then call superintendent Then document in logbook Then send email summary Average time to complete notifications: 45 minutes (if done immediately) Reality: Often delayed until shift change or day shift
After system redesign:
Monitoring system automatically detects TARP trigger Sends notifications without operator action But operator still has role: Confirm situation, add observations, indicate any immediate actions taken System presents operator with simple interface: What do you see? What have you done? Average time from trigger to notifications complete: 90 seconds
The result: Operator’s job became easier (less work) while safety improved (faster response, no missed notifications). Characteristic 2: Visible Accountability (Everyone Knows Who Does What) Systems that fail at 3 AM have ambiguous accountability: Example questions operators can’t answer:
“Who do I call first?” “What if the RTFE doesn’t answer?” “Do I call the EOR or does the RTFE do that?” “Can I make this decision or do I need approval?”
When accountability is unclear, people hesitate or defer. Systems that work at 3 AM have crystal-clear accountability: Real example from accountability matrix (simplified): SituationImmediate ActionWho DecidesWho Must Be NotifiedWho Can Stop OperationsYellow TARPIncrease monitoringSupervisorRTFE (within 2 hrs)RTFEOrange TARPImplement operational restrictionsRTFERTFE + Superintendent + EORRTFE or aboveRed TARPStop operations immediatelyAnyoneEveryone + Emergency responseAnyoneEquipment failure affecting monitoringSwitch to backup/manualSupervisorMaintenance + RTFEN/A (operations continue)Visible seepage at toeStop deposition, document, investigateSupervisorRTFE immediatelyRTFE This matrix is:
Visible (posted in control room, on mobile devices, in field offices) Specific (no ambiguity) Empowering (people know they have authority to act)
And here’s the key: The matrix explicitly says “Anyone” can trigger Red TARP. The message: If you think it’s an emergency, you’re empowered to call it. You don’t need permission to be concerned. Characteristic 3: Forcing Functions (Making Wrong Choices Difficult) Forcing functions are system features that prevent or flag dangerous actions. Example from aviation: Aircraft doors designed so they can’t be opened during flight (pressure differential makes it physically impossible). You can’t accidentally do the wrong thing. Tailings management equivalent: Scenario: Deposition operations approaching freeboard minimum System without forcing function:
Operations continue until someone notices freeboard is inadequate Relies on human vigilance
System with forcing function:
Monitoring system tracks freeboard continuously At 2.5m freeboard: Yellow alert displays on operations screens (“Approaching minimum freeboard”) At 2.0m freeboard (minimum): Orange alert + automatic notification to RTFE + recommendation to suspend deposition At 1.5m freeboard (critical): Red alert + operations must acknowledge alert and document justification to continue
The forcing function: System makes it hard to continue operations without acknowledging you’re in non-compliant state. Real example from gold mine in Nevada: They implemented “Design Intent Verification” forcing functions: Beach slope monitoring:
Target slope: 1-2% Drone survey calculates actual slopes weekly If any area exceeds 3%: System flags, operations receives notification Forcing function: Can’t proceed with deposition in that area until either slope is corrected or engineering approval obtained and documented
Result: Before implementation, beach slopes occasionally reached 5-6% (concerning from stability perspective) before being corrected. After implementation, never exceeded 3.5% because system caught and forced correction early. Characteristic 4: Redundancy (No Single Points of Failure) Systems that fail at 3 AM rely on single individuals or components: Scenarios of single-point failures:
Only the RTFE understands the TARPs (what happens when RTFE is on vacation?) Only one person knows how to interpret specific monitoring data (what if they’re unavailable?) Critical information exists in one location (what if that system fails?) Single communication channel (what if it’s down?)
Systems that work at 3 AM have redundancy: Redundancy in people:
Primary RTFE and designated backup RTFE (cross-trained, both have authority) Multiple people trained on TARPs and response protocols Succession planning for key roles Documentation sufficient that qualified personnel can step in
Redundancy in technology:
Primary monitoring system and backup (different platforms if possible) Automated alerts via multiple channels (SMS + email + app + phone call) Data stored redundantly (local + cloud, multiple backups) Power backup for critical systems (generators, UPS)
Redundancy in communication:
Multiple contact methods for key personnel (work phone + personal phone + email + home phone) Communication trees (if primary person doesn’t respond, system contacts secondary) Physical backups (printed emergency contacts in control room)
Real example from mine in Indonesia: 2019 incident: Primary piezometer monitoring system failed during weekend. Data logger corrupted. Nobody noticed until Monday because there was no backup. System redesign:
Installed redundant monitoring (two independent systems reading same instruments) Automated daily data validation (compares readings between systems, flags if divergent) Weekly manual verification checks (field personnel physically verify key instruments are functioning) Backup power systems with automatic switchover Redundant communication systems (cellular + satellite)
2023 test: Primary system had hardware failure on Friday night. Backup system continued operating, failure was detected via automated validation, maintenance team notified and fixed by Saturday afternoon. Operations continued safely throughout. No single point of failure meant system degraded gracefully instead of failing catastrophically. Characteristic 5: Feedback Loops (System Self-Corrects) Systems that fail at 3 AM are open-loop: Open-loop system:
Input — Action — Output No verification that output is correct No correction if output is wrong
Example: Procedure says “conduct daily visual inspection.” Person does inspection. Completes checklist. Done. But: What if they missed something? What if conditions changed after inspection? What if checklist doesn’t capture what matters today? Systems that work at 3 AM are closed-loop: Closed-loop system:
Input — Action — Output — Measurement — Comparison to Expected — Correction if Needed
Example: Daily visual inspection, but with feedback:
Inspection conducted, checklist completed Photos taken at key locations Photos automatically compared to previous days (AI flags changes) Supervisor reviews flagged changes If significant change: Triggers investigation Investigation results feed back into inspection checklist (what should we look for in future?)
The system learns and improves. Real example of feedback loop from copper mine in Canada: TARP response tracking system:
Event: TARP trigger occurs Action: Response protocol implemented Measurement: System tracks response time, actions taken, outcome Analysis: Monthly review of all TARP triggers:
How quickly did we respond? Were responses effective? Did we follow protocols? Were protocols appropriate?
Correction: Update TARPs, training, systems based on analysis Verification: Track whether changes improved performance
What they discovered through feedback analysis: Discovery 1: Average time from Yellow trigger to RTFE notification: 3.2 hours (target was 2 hours) Root cause: Night shift operators hesitant to wake RTFE for “maybe nothing” Correction: Changed protocol - automated notification plus operator adds context (“heavy rain ongoing, rising trend continues” vs “rain stopped, reading stabilizing”). RTFE can assess situation without operator worrying about false alarm. Result: Average notification time dropped to 0.3 hours. Discovery 2: Orange TARPs triggered three times in 6 months, each time during heavy rain. Piezometers spiked, then dropped back to normal within 48 hours as water drained. Question: Are thresholds too sensitive? Or is rapid response appropriate? Analysis: EOR reviewed. Determined thresholds were appropriate for detecting potential issues, but rapid reversals indicated drainage was functioning well. Not false alarms - system working as designed. Correction: No threshold change, but enhanced training for operations: “These spikes during storms are expected. We respond because it’s prudent, not because we think failure is imminent. If piezometers DON’T drop back after rain stops, that’s the real concern.” Result: Operations staff better understood the system, less alarm fatigue, maintained appropriate vigilance. The feedback loop turned data into learning. Characteristic 6: Resilience (System Works Under Stress) Systems that fail at 3 AM are brittle: Brittle systems:
Work fine under normal conditions Fail when stressed (unusual weather, equipment failures, multiple simultaneous issues, key personnel absent) Catastrophic failure modes (small perturbations cause large failures)
Systems that work at 3 AM are resilient: Resilient systems:
Work under normal conditions AND abnormal conditions Degrade gracefully under stress (performance decreases but doesn’t collapse) Recover quickly from disturbances Learn from stress events and become stronger
Real example: Testing resilience through simulation: Mine in Australia conducted “stress test simulations”: Simulation 1: “Everything Breaks Friday Night”
Scenario: Primary monitoring system fails at 10 PM Friday. Backup system shows concerning piezometer trend. Storm forecast for overnight. RTFE on vacation. Superintendent at wedding (unreachable). EOR not answering phone. Question: What happens?
Pre-2020 reality: Would have been chaos. Probably delayed response until Monday. Possibly dangerous. Post-2020 system:
Backup monitoring continues Backup RTFE receives automated alert Operations supervisor has authority to implement precautionary measures (reduce deposition, prepare for emergency) On-call engineer (designated weekly) provides technical support EOR has backup contact (partner in firm who knows the facility) System degraded but didn’t fail
Simulation 2: “Cascade Scenario”
Heavy rain — Rising piezometers — TARP Yellow triggered — Increased monitoring reveals instrument failure — Need to assess with reduced instrumentation — Meanwhile beach slope steepening detected — Multiple issues simultaneously
Question: Can system handle multiple concurrent issues? Result: Identified that their system could handle 2 simultaneous issues but would struggle with 3+. Created surge capacity plan: Pre-identified contractors on standby, emergency budget authority, rapid mobilization protocols. The value of stress testing: Reveals system weaknesses before real emergencies occur. The Human Element: Systems Must Account for How Humans Actually Behave Here’s an uncomfortable truth: Even perfect procedures fail if they don’t account for human psychology. Human Factor 1: Decision Fatigue Reality: By 3 AM on a 12-hour shift, people’s decision-making capability is impaired. System design response:
Minimize decisions required during off-hours Automate what can be automated Provide clear guidance for required decisions Escalate to fresh personnel when possible
Example: Bad system: “If piezometer reading is elevated, determine appropriate response based on conditions, rainfall history, other instrument readings, and engineering judgment.” At 3 AM after 9 hours on shift: That’s too much cognitive load. Person will probably default to “wait until day shift.” Good system: “If piezometer exceeds threshold: (1) Automated alert sent. (2) You: Verify reading by checking other instruments in same zone. (3) System displays: Are other readings also elevated? Yes/No. (4) If Yes: Implement Protocol A (displayed on screen). If No: Possible instrument issue - switch to backup if available, document in log.” Decision reduced to simple verification and following displayed protocol. Human Factor 2: Normalization of Deviance Reality: When nothing bad happens despite minor deviations from procedures, people gradually accept larger deviations as normal. Example progression:
Week 1: TARP says notify within 2 hours. We notified within 3 hours. Nothing bad happened. Week 4: Notified within 6 hours. Still fine. Week 12: Waiting until next shift to notify. Becomes the norm. Week 30: Major event occurs. Response delayed because actual practice had drifted far from procedure.
System design response: Audit trail and performance monitoring:
System tracks ACTUAL response times, not just whether response occurred Automated reports flag deviations (monthly report to Accountable Executive shows: Target 2-hour notification achieved 73% of time, average actual 4.1 hours) Deviations trigger review: Why are we not meeting targets? Are targets unrealistic? Or are procedures not being followed?
Either way, visibility prevents drift. Real example from mine that caught normalization early: Monthly TARP audit revealed: Orange triggers should stop deposition immediately pending assessment. Actual practice: Deposition continued while assessment occurred (average 4 hours). Happened 5 times in 6 months. Investigation revealed: Operations felt stopping immediately was “overreacting” because assessments always concluded deposition could resume. So they kept going while waiting for assessment. Response:
Short term: Reminder that protocol must be followed as written Long term: Protocol updated - Orange trigger allows continued deposition for up to 2 hours IF assessment is underway AND RTFE explicitly authorizes. After 2 hours, must stop until assessment complete.
Result: Protocol modified to match reality while maintaining safety. But original protocol was enforced until formal change occurred - preventing normalization from continuing unchecked. Human Factor 3: Bystander Effect and Diffusion of Responsibility Reality: When multiple people are present (or could be involved), individuals are less likely to take action, assuming someone else will. Example: Multiple shift supervisors see concerning monitoring trend. Each assumes another supervisor will report it. Nobody reports it. System design response: Clear individual assignment:
Not “someone should check piezometers” But “Charlie: You’re assigned piezometer check today, results due by 14:00, system will alert if not completed”
Forcing acknowledgment:
Critical alerts require acknowledgment: “Click here to confirm YOU have received this alert and are taking responsibility for response” No ambiguity about who’s responsible
Escalation for inaction:
If alert not acknowledged within timeframe, escalates to supervisor Creates accountability
Human Factor 4: Hindsight Bias (The “Should Have Known” Problem) Reality: After an incident, it’s easy to say “they should have known this was serious.” But in the moment, without hindsight, signals are ambiguous. System design response: Design for ambiguity:
Accept that people will face ambiguous situations Err on side of caution Make “escalating when uncertain” the encouraged behavior Don’t punish people for “false alarms”
Real example of handling ambiguity well: Mine in Peru, 2 AM, night supervisor observes:
Piezometer P-7 reading increased 8% over 6 hours Still well below Yellow threshold Heavy rain earlier (stopped 3 hours ago) Reading might just be rainfall response Or might be beginning of trend
What he did: Called backup RTFE (designated on-call person). Explained situation: “Probably nothing, but wanted you aware. I’m watching it.” Backup RTFE response: “Good call. Keep monitoring every hour. If it continues rising or if any other instruments show changes, call me back immediately. If it stabilizes, we’ll review in morning.” Outcome: Reading stabilized within 2 hours. Was rainfall response. But here’s the important part: Next week, at operations meeting:
Supervisor mentioned the call Some colleagues: “You woke up the RTFE for nothing?” RTFE (who was there): “No - that was exactly right. I’d rather get ten calls about trends that turn out to be nothing than miss one that’s actually important. You made the right judgment call.”
The culture message: Erring on side of caution is encouraged, not mocked. That’s a system that works. Building Systems, Not Just Procedures: A Practical Framework If you’re convinced your TMS needs to be a system rather than documents, but don’t know how to get there: Step 1: Map Current State (Where Systems Break Down) Exercise for your team: Present scenarios and ask: “What would actually happen?” Scenario examples:
“3 AM Saturday. Automated alert shows piezometer Yellow threshold. What happens in next 30 minutes?” “Day shift. Multiple instruments showing unusual readings (not TARP triggers but weird). Who investigates? How long does it take?” “RTFE on vacation, superintendent at conference. Orange TARP triggered. Who makes decisions?”
Map the actual process:
Who gets notified (really, not theoretically)? How long does it take (actually, not per procedure)? What decisions are made and by whom? Where does process break down? What information isn’t available when needed?
Identify gaps:
Ambiguous responsibilities Missing information flows Technology that doesn’t support workflow Procedures nobody actually follows Single points of failure
Step 2: Design System Components (Technology + People + Processes) For each key function, design integrated system: Example: TARP Response System Components needed: Technology:
Automated threshold detection Multi-channel notification system Decision support tools (what data should I look at?) Documentation tools (capture what happened) Escalation timers (alerts if no response)
People:
Clear role assignments Primary and backup personnel Training on system use Authority levels defined
Processes:
TARP thresholds and responses Communication protocols Decision-making frameworks After-action review process
Integration:
Technology supports people executing processes Processes leverage technology capabilities People understand why system is designed this way
Step 3: Build in Feedback and Improvement Every system component needs feedback mechanism: Examples: TARP system:
Track every trigger: Response time, actions taken, outcome, whether effective Monthly analysis: Are thresholds appropriate? Are responses working? Are people following protocols? Quarterly review: System improvements based on analysis
Training system:
Track training completion Test knowledge retention Measure performance of trained vs untrained personnel Update training based on actual mistakes made in field
Monitoring system:
Track instrument reliability Measure data quality Identify instruments that frequently fail or give questionable readings Optimize monitoring network based on value of different instruments
Step 4: Test Under Realistic Conditions Don’t just assume system will work - test it: Types of testing: Functional testing: Does each component work? (Technical testing of software, instruments, communications) Integration testing: Do components work together? (Full workflow tests) Stress testing: Does system work under abnormal conditions? (Simulations of multiple failures, off-hours scenarios, key personnel absent) User testing: Can actual personnel use the system effectively? (Observe real users, identify where they struggle) Real example of testing revealing problems: Mine tested their emergency response system:
Simulated Orange TARP at 2 AM Friday Tested whether notification system worked, whether personnel responded appropriately, whether information was available
What they discovered:
Technology worked: Alerts sent, received, acknowledged Information gap: Personnel needed to review recent weather and operations data to assess situation. That data wasn’t easily accessible at 2 AM (required logging into multiple systems) Decision support gap: RTFE received alert but no guidance on what information to review or who to consult
Result: System redesign included integrated dashboard accessible via phone showing recent weather, operations, and all relevant monitoring data in one place. The test revealed that system worked technically but didn’t effectively support decision-making. Step 5: Embed Continuous Improvement Systems must evolve: What this looks like: Quarterly system review:
What incidents occurred? How did system perform? What near-misses happened? What external learnings are relevant? (incidents at other facilities, new technologies, regulatory changes) What improvements should we make?
Annual system audit:
Is system still fit for purpose? Have risks changed requiring system updates? Are technologies obsolete? Do personnel have skills and knowledge needed? Are procedures still followed and effective?
Continuous learning capture:
After-action reviews for every significant event Learning documented and shared System updated based on learnings Trends analyzed for systemic issues
Real example from mine with mature continuous improvement: They maintain “System Evolution Log”:
Every system change documented with rationale Links to incidents or learnings that drove change Effectiveness of changes tracked over time
Example entry:
Date: March 2023 Change: Added redundant cellular and satellite communication for monitoring system Rationale: November 2022 incident - cellular network down during storm, monitoring data not available remotely Effectiveness tracking: Communication uptime 99.97% since change (vs 97.3% before). No communication gaps during 8 significant weather events since implementation.
The log shows system continuously learning and improving. The Compliance System’s Role: Enabling Systems Thinking Your GISTM compliance platform should support systems, not just document requirements: What Systems-Thinking Compliance Platforms Enable:
- Workflow Integration
Connect monitoring data — TARP assessment — Response actions — Documentation — Review Automated routing of information to right people at right time Capture complete system activity, not just endpoints
- Performance Analytics
Track system performance (response times, effectiveness, compliance rates) Identify trends and patterns Flag degradation before failure
- Feedback Loop Support
After-action review templates and tracking Learning capture and sharing System improvement tracking
- Scenario Testing
Simulate events and test system response Document exercises and learnings Track preparedness over time
- Integration with Operational Systems
Connect to monitoring systems (data flows automatically) Connect to operations (deposition records, maintenance, weather) Single view of facility status enabling better decisions
- Mobile Access
System accessible from anywhere (critical for 3 AM scenarios) Works on phones, tablets, laptops Offline capability for remote locations
The goal: Compliance system becomes the infrastructure that enables your TMS to function as an integrated system, not a documentation repository. The Question Only You Can Answer As Accountable Executive, imagine this: It’s 3 AM. You’re asleep. Unusual conditions develop at your facility - not catastrophic, but requiring judgment calls and coordinated response. Your night shift supervisor is facing this situation without you. Question: Do you trust your system? Not: Do you trust that individual? (Personnel change) But: Do you trust that your TMS will:
Detect the situation Notify the right people Provide the information needed Support appropriate decisions Document what happens Escalate if needed Keep people safe
If your honest answer is “I hope so” or “probably” or “I’m not sure” - you have work to do. Because GISTM doesn’t just require that you have a TMS. It requires that your TMS actually works - all day, every day, including at 3 AM when nobody’s watching. That’s the test that matters. Is your TMS passing it?
Does your GISTM compliance system enable an integrated TMS, or just document isolated procedures? [Discover platforms that support systems thinking and resilient operations]