Monitoring strategy guide

Effective server monitoring and logging strategies enable you to maintain system stability and prevent failures proactively.

Core monitoring metrics

Alpacon collects and visualizes key server performance metrics in real-time.

Basic metrics

The following metrics are collected by default for all servers:

CPU usage: Overall CPU utilization monitoring
Memory usage: System memory usage and available space tracking
Disk usage: Partition-level capacity and utilization verification

Advanced metrics (paid plans)

Additional metrics for detailed performance analysis:

Disk I/O: Peak/AVG performance and per-disk I/O analysis
Network traffic: Peak/AVG bandwidth, bps/pps, per-interface traffic analysis

For more details, see Server detail - Monitoring tab.

Monitoring rule configuration

Proper monitoring rule configuration is essential for effective monitoring.

Rule configuration strategy

Threshold setting
- CPU usage: Alert when sustained above 80%
- Memory usage: Alert when reaching 90%
- Disk usage: Alert when reaching 85%
Alert priority classification
- Critical: Situations requiring immediate action (e.g., disk >95%)
- Warning: Situations requiring monitoring (e.g., CPU >80%)
- Info: Informational notifications only
Group-based rule application
- Production servers: Apply stricter thresholds
- Development/test servers: Apply flexible thresholds
- Database servers: Focus on I/O and memory monitoring

Rule assignment method

You can assign predefined monitoring rules from Workspace settings to servers.

Access the Monitoring Rules tab on the server detail screen
Select rules to apply
Configure notification recipients

For more details, see Server detail - Monitoring Rules tab.

Log management

Logs are critical resources for failure analysis and system auditing.

Recent logs

Review logs generated by the Alpamon agent to identify:

Agent status changes
Connection errors and retry history
Data collection failure causes

Recent backhauls

Monitor the transmission status of data collected from servers:

Backhaul job success/failure status
Transmission time and data size
Transmission delays due to network issues

For more details, see Server detail - Logs tab.

Monitoring usage scenarios

1. Performance bottleneck detection

Comprehensively analyze CPU, memory, and disk I/O metrics to identify performance bottlenecks.

Examples:

Normal CPU but high disk I/O → Disk performance improvement needed
Continuously increasing memory usage → Potential memory leak

2. Capacity planning

Analyze disk usage trends to predict capacity expansion timing.

Recommendations:

Analyze monthly usage growth trends
Begin capacity expansion review when disk reaches 85%
Proactively secure capacity based on projected growth rate

3. Incident response

Establish rapid response processes when receiving alert notifications.

Basic response procedure:

Receive alert notification
Check current metrics on server detail screen
Analyze recent events in Logs tab
Access immediately via Websh if necessary

Incident response scenarios

Real-world response methods for major failure situations.

Scenario 1: CPU overload

Symptoms:

CPU usage sustained above 95%
Service response slowdown
Repeated critical alerts

Immediate response (within 5 minutes):

1. Server detail → Monitoring tab
   → Check CPU usage trends
   → Identify spike timing

2. Activity tab → Command History
   → Check recently executed commands
   → Verify deployment or configuration changes

3. Access immediately via Websh
   → Run top or htop
   → Identify high CPU processes

4. Temporary measures
   → Terminate unnecessary processes (kill)
   → Load balancing or scaling

Root cause analysis (within 30 minutes):

5. Logs tab → Recent Logs
   → Check error messages
   → Analyze application logs

6. Identify and address root cause
   - Infinite loop bug → Deploy hotfix
   - Traffic surge → Caching, scale out
   - Resource leak → Schedule restart

Scenario 2: Memory shortage

Symptoms:

Memory usage above 90%
OOM (Out Of Memory) errors
Sudden service termination

Immediate response:

1. Server detail → Monitoring tab
   → Check memory usage trends
   → Identify spike or sustained increase patterns

2. Websh access
   → free -h (memory status)
   → ps aux --sort=-%mem | head -20
   → Check top 20 memory-consuming processes

3. Free temporary memory
   → Stop unnecessary services
   → Clear cache (drop_caches)
   → Verify swap activation

4. Long-term solutions
   - Memory leak → Code fix and deployment
   - Normal increase → Server upgrade

Scenario 3: Disk space shortage

Symptoms:

Disk usage above 90%
“No space left on device” errors
Log recording failures

Immediate response:

1. Server detail → Monitoring tab
   → Check disk usage
   → Identify per-partition usage status

2. Websh access
   → df -h (per-partition usage)
   → du -sh /* | sort -hr | head -10
   → Identify directories consuming the most space

3. Emergency space recovery
   → Delete old log files
   → Clean temporary files (/tmp, /var/tmp)
   → Remove unnecessary packages
   → Compress after backup

4. Configure log rotation
   → Check logrotate settings
   → Adjust log retention period
   → Set up automatic cleanup scripts

Scenario 4: Network traffic surge

Symptoms (paid plans):

Network traffic 300%+ above normal
Bandwidth limit reached
Response delays occurring

Immediate response:

1. Server detail → Monitoring tab
   → Check network traffic trends
   → Compare Peak vs AVG
   → Analyze per-interface traffic

2. Identify traffic causes
   → Analyze access logs
   → Check for DDoS attacks
   → Normal traffic surge vs abnormal traffic

3. Response measures
   Normal traffic surge:
   → Activate CDN
   → Strengthen caching
   → Add load balancers

   Abnormal traffic (DDoS):
   → Add firewall rules
   → Block IPs
   → Activate DDoS defense services

Scenario 5: Disk I/O bottleneck

Symptoms (paid plans):

High disk I/O wait
Query response slowdown
Application timeouts

Immediate response:

1. Server detail → Monitoring tab
   → Check disk I/O Peak/AVG
   → Analyze per-disk I/O

2. Websh access
   → iostat -x 1 10
   → Check I/O wait rate (%iowait)
   → iotop (identify high I/O processes)

3. Immediate measures
   → Pause I/O-intensive tasks
   → Reschedule backup/batch jobs
   → Minimize unnecessary file access

4. Long-term solutions
   - Index optimization (DB)
   - Upgrade to SSD
   - Add read replica (DB)
   - Introduce caching layer

Proactive prevention strategies

Baseline establishment

Record metric patterns during normal operations to quickly detect anomalies.

Baseline recording items:

Daily patterns:
- Business hours (9-18): Average CPU 30-40%
- Night time (0-6): Average CPU 10-15%
- Lunch time (12-13): Traffic decrease

Weekly patterns:
- Monday morning: Traffic increase (weekly batch)
- Wednesday evening: Service deployment
- Weekends: Low load

Monthly patterns:
- Start of month: CPU/DB load increase due to billing batch
- Disk usage: 5GB increase per month (logs)

Capacity planning

Analyze metric trends to secure resources proactively.

Planning method:

1. Collect past 3 months of data
   → Server detail → Monitoring tab
   → Record monthly Peak and AVG

2. Calculate growth rate
   → Disk: Average 10GB increase per month
   → Memory: 5% increase per month
   → Traffic: 15% increase per month

3. Predict threshold arrival time
   → Current disk 60% → 90% in 6 months
   → Plan capacity expansion 3 months ahead

4. Secure budget and execute

Regular inspection

Daily inspection:

Check and respond to critical alerts
Visually verify key server metrics
Check for sudden changes compared to previous day

Weekly inspection:

Analyze metric trends for all servers
Review unresolved warning alerts
Analyze log error patterns
Verify disk usage growth trends

Monthly inspection:

Readjust monitoring rule thresholds
Update baselines
Review capacity planning progress
Analyze and improve incident response times

Recommendations

Regular reviews

Daily: Check critical alerts and abnormal metrics
Weekly: Analyze per-server metric trends
Monthly: Review and adjust monitoring rule appropriateness

Efficient alert management

Minimize unnecessary alerts to prevent alert fatigue
Clearly distinguish between critical and informational alerts
Differentiate notification channels based on team roles

Documentation

Record normal operation baselines for each server
Document past incident cases and response methods
Create response manuals for alert occurrences

Server monitoring: Detailed server monitoring features
Workspace settings: Monitoring rule configuration
Groups: Group-based server monitoring

Monitoring strategy guide

Core monitoring metrics

Basic metrics

Advanced metrics (paid plans)

Monitoring rule configuration

Rule configuration strategy

Rule assignment method

Log management

Recent logs

Recent backhauls

Monitoring usage scenarios

1. Performance bottleneck detection

2. Capacity planning

3. Incident response

Incident response scenarios

Scenario 1: CPU overload

Scenario 2: Memory shortage

Scenario 3: Disk space shortage

Scenario 4: Network traffic surge

Scenario 5: Disk I/O bottleneck

Proactive prevention strategies

Baseline establishment

Capacity planning

Regular inspection

Recommendations

Regular reviews

Efficient alert management

Documentation

Related documentation