Monitoring strategy guide
Effective server monitoring and logging strategies enable you to maintain system stability and prevent failures proactively.
Core monitoring metrics
Alpacon collects and visualizes key server performance metrics in real-time.
Basic metrics
The following metrics are collected by default for all servers:
- CPU usage: Overall CPU utilization monitoring
- Memory usage: System memory usage and available space tracking
- Disk usage: Partition-level capacity and utilization verification
Advanced metrics (paid plans)
Additional metrics for detailed performance analysis:
- Disk I/O: Peak/AVG performance and per-disk I/O analysis
- Network traffic: Peak/AVG bandwidth, bps/pps, per-interface traffic analysis
For more details, see Server detail - Monitoring tab.
Alert rule configuration
Proper alert rule configuration is essential for effective monitoring.
Rule configuration strategy
-
Threshold setting
- CPU usage: Alert when sustained above 80%
- Memory usage: Alert when reaching 90%
- Disk usage: Alert when reaching 85%
-
Alert priority classification
- Critical: Situations requiring immediate action (e.g., disk >95%)
- Warning: Situations requiring monitoring (e.g., CPU >80%)
- Info: Informational notifications only
-
Group-based rule application
- Production servers: Apply stricter thresholds
- Development/test servers: Apply flexible thresholds
- Database servers: Focus on I/O and memory monitoring
Rule assignment method
You can assign predefined alert rules from Workspace settings to servers.
- Access the Alert Rules tab on the server detail screen
- Select rules to apply
- Configure notification recipients
For more details, see Server detail - Alert Rules tab.
Log management
Logs are critical resources for failure analysis and system auditing.
Recent logs
Review logs generated by the Alpamon agent to identify:
- Agent status changes
- Connection errors and retry history
- Data collection failure causes
Recent backhauls
Monitor the transmission status of data collected from servers:
- Backhaul job success/failure status
- Transmission time and data size
- Transmission delays due to network issues
For more details, see Server detail - Logs tab.
Monitoring usage scenarios
1. Performance bottleneck detection
Comprehensively analyze CPU, memory, and disk I/O metrics to identify performance bottlenecks.
Examples:
- Normal CPU but high disk I/O → Disk performance improvement needed
- Continuously increasing memory usage → Potential memory leak
2. Capacity planning
Analyze disk usage trends to predict capacity expansion timing.
Recommendations:
- Analyze monthly usage growth trends
- Begin capacity expansion review when disk reaches 85%
- Proactively secure capacity based on projected growth rate
3. Incident response
Establish rapid response processes when receiving alert notifications.
Basic response procedure:
- Receive alert notification
- Check current metrics on server detail screen
- Analyze recent events in Logs tab
- Access immediately via Websh if necessary
Incident response scenarios
Real-world response methods for major failure situations.
Scenario 1: CPU overload
Symptoms:
- CPU usage sustained above 95%
- Service response slowdown
- Repeated critical alerts
Immediate response (within 5 minutes):
1. Server detail → Monitoring tab
→ Check CPU usage trends
→ Identify spike timing
2. Activity tab → Command History
→ Check recently executed commands
→ Verify deployment or configuration changes
3. Access immediately via Websh
→ Run top or htop
→ Identify high CPU processes
4. Temporary measures
→ Terminate unnecessary processes (kill)
→ Load balancing or scaling
Root cause analysis (within 30 minutes):
5. Logs tab → Recent Logs
→ Check error messages
→ Analyze application logs
6. Identify and address root cause
- Infinite loop bug → Deploy hotfix
- Traffic surge → Caching, scale out
- Resource leak → Schedule restart
Scenario 2: Memory shortage
Symptoms:
- Memory usage above 90%
- OOM (Out Of Memory) errors
- Sudden service termination
Immediate response:
1. Server detail → Monitoring tab
→ Check memory usage trends
→ Identify spike or sustained increase patterns
2. Websh access
→ free -h (memory status)
→ ps aux --sort=-%mem | head -20
→ Check top 20 memory-consuming processes
3. Free temporary memory
→ Stop unnecessary services
→ Clear cache (drop_caches)
→ Verify swap activation
4. Long-term solutions
- Memory leak → Code fix and deployment
- Normal increase → Server upgrade
Scenario 3: Disk space shortage
Symptoms:
- Disk usage above 90%
- “No space left on device” errors
- Log recording failures
Immediate response:
1. Server detail → Monitoring tab
→ Check disk usage
→ Identify per-partition usage status
2. Websh access
→ df -h (per-partition usage)
→ du -sh /* | sort -hr | head -10
→ Identify directories consuming the most space
3. Emergency space recovery
→ Delete old log files
→ Clean temporary files (/tmp, /var/tmp)
→ Remove unnecessary packages
→ Compress after backup
4. Configure log rotation
→ Check logrotate settings
→ Adjust log retention period
→ Set up automatic cleanup scripts
Scenario 4: Network traffic surge
Symptoms (paid plans):
- Network traffic 300%+ above normal
- Bandwidth limit reached
- Response delays occurring
Immediate response:
1. Server detail → Monitoring tab
→ Check network traffic trends
→ Compare Peak vs AVG
→ Analyze per-interface traffic
2. Identify traffic causes
→ Analyze access logs
→ Check for DDoS attacks
→ Normal traffic surge vs abnormal traffic
3. Response measures
Normal traffic surge:
→ Activate CDN
→ Strengthen caching
→ Add load balancers
Abnormal traffic (DDoS):
→ Add firewall rules
→ Block IPs
→ Activate DDoS defense services
Scenario 5: Disk I/O bottleneck
Symptoms (paid plans):
- High disk I/O wait
- Query response slowdown
- Application timeouts
Immediate response:
1. Server detail → Monitoring tab
→ Check disk I/O Peak/AVG
→ Analyze per-disk I/O
2. Websh access
→ iostat -x 1 10
→ Check I/O wait rate (%iowait)
→ iotop (identify high I/O processes)
3. Immediate measures
→ Pause I/O-intensive tasks
→ Reschedule backup/batch jobs
→ Minimize unnecessary file access
4. Long-term solutions
- Index optimization (DB)
- Upgrade to SSD
- Add read replica (DB)
- Introduce caching layer
Proactive prevention strategies
Baseline establishment
Record metric patterns during normal operations to quickly detect anomalies.
Baseline recording items:
Daily patterns:
- Business hours (9-18): Average CPU 30-40%
- Night time (0-6): Average CPU 10-15%
- Lunch time (12-13): Traffic decrease
Weekly patterns:
- Monday morning: Traffic increase (weekly batch)
- Wednesday evening: Service deployment
- Weekends: Low load
Monthly patterns:
- Start of month: CPU/DB load increase due to billing batch
- Disk usage: 5GB increase per month (logs)
Capacity planning
Analyze metric trends to secure resources proactively.
Planning method:
1. Collect past 3 months of data
→ Server detail → Monitoring tab
→ Record monthly Peak and AVG
2. Calculate growth rate
→ Disk: Average 10GB increase per month
→ Memory: 5% increase per month
→ Traffic: 15% increase per month
3. Predict threshold arrival time
→ Current disk 60% → 90% in 6 months
→ Plan capacity expansion 3 months ahead
4. Secure budget and execute
Regular inspection
Daily inspection:
- Check and respond to critical alerts
- Visually verify key server metrics
- Check for sudden changes compared to previous day
Weekly inspection:
- Analyze metric trends for all servers
- Review unresolved warning alerts
- Analyze log error patterns
- Verify disk usage growth trends
Monthly inspection:
- Readjust alert rule thresholds
- Update baselines
- Review capacity planning progress
- Analyze and improve incident response times
Recommendations
Regular reviews
- Daily: Check critical alerts and abnormal metrics
- Weekly: Analyze per-server metric trends
- Monthly: Review and adjust alert rule appropriateness
Efficient alert management
- Minimize unnecessary alerts to prevent alert fatigue
- Clearly distinguish between critical and informational alerts
- Differentiate notification channels based on team roles
Documentation
- Record normal operation baselines for each server
- Document past incident cases and response methods
- Create response manuals for alert occurrences
Related documentation
- Server monitoring: Detailed server monitoring features
- Workspace settings: Alert rule configuration
- Groups: Group-based server monitoring