Monitoring strategy guide

Effective server monitoring and logging strategies enable you to maintain system stability and prevent failures proactively.


Core monitoring metrics

Alpacon collects and visualizes key server performance metrics in real-time.

Basic metrics

The following metrics are collected by default for all servers:

  • CPU usage: Overall CPU utilization monitoring
  • Memory usage: System memory usage and available space tracking
  • Disk usage: Partition-level capacity and utilization verification

Advanced metrics (paid plans)

Additional metrics for detailed performance analysis:

  • Disk I/O: Peak/AVG performance and per-disk I/O analysis
  • Network traffic: Peak/AVG bandwidth, bps/pps, per-interface traffic analysis

For more details, see Server detail - Monitoring tab.


Alert rule configuration

Proper alert rule configuration is essential for effective monitoring.

Rule configuration strategy

  1. Threshold setting

    • CPU usage: Alert when sustained above 80%
    • Memory usage: Alert when reaching 90%
    • Disk usage: Alert when reaching 85%
  2. Alert priority classification

    • Critical: Situations requiring immediate action (e.g., disk >95%)
    • Warning: Situations requiring monitoring (e.g., CPU >80%)
    • Info: Informational notifications only
  3. Group-based rule application

    • Production servers: Apply stricter thresholds
    • Development/test servers: Apply flexible thresholds
    • Database servers: Focus on I/O and memory monitoring

Rule assignment method

You can assign predefined alert rules from Workspace settings to servers.

  1. Access the Alert Rules tab on the server detail screen
  2. Select rules to apply
  3. Configure notification recipients

For more details, see Server detail - Alert Rules tab.


Log management

Logs are critical resources for failure analysis and system auditing.

Recent logs

Review logs generated by the Alpamon agent to identify:

  • Agent status changes
  • Connection errors and retry history
  • Data collection failure causes

Recent backhauls

Monitor the transmission status of data collected from servers:

  • Backhaul job success/failure status
  • Transmission time and data size
  • Transmission delays due to network issues

For more details, see Server detail - Logs tab.


Monitoring usage scenarios

1. Performance bottleneck detection

Comprehensively analyze CPU, memory, and disk I/O metrics to identify performance bottlenecks.

Examples:

  • Normal CPU but high disk I/O → Disk performance improvement needed
  • Continuously increasing memory usage → Potential memory leak

2. Capacity planning

Analyze disk usage trends to predict capacity expansion timing.

Recommendations:

  • Analyze monthly usage growth trends
  • Begin capacity expansion review when disk reaches 85%
  • Proactively secure capacity based on projected growth rate

3. Incident response

Establish rapid response processes when receiving alert notifications.

Basic response procedure:

  1. Receive alert notification
  2. Check current metrics on server detail screen
  3. Analyze recent events in Logs tab
  4. Access immediately via Websh if necessary

Incident response scenarios

Real-world response methods for major failure situations.

Scenario 1: CPU overload

Symptoms:

  • CPU usage sustained above 95%
  • Service response slowdown
  • Repeated critical alerts

Immediate response (within 5 minutes):

1. Server detail → Monitoring tab
   → Check CPU usage trends
   → Identify spike timing

2. Activity tab → Command History
   → Check recently executed commands
   → Verify deployment or configuration changes

3. Access immediately via Websh
   → Run top or htop
   → Identify high CPU processes

4. Temporary measures
   → Terminate unnecessary processes (kill)
   → Load balancing or scaling

Root cause analysis (within 30 minutes):

5. Logs tab → Recent Logs
   → Check error messages
   → Analyze application logs

6. Identify and address root cause
   - Infinite loop bug → Deploy hotfix
   - Traffic surge → Caching, scale out
   - Resource leak → Schedule restart

Scenario 2: Memory shortage

Symptoms:

  • Memory usage above 90%
  • OOM (Out Of Memory) errors
  • Sudden service termination

Immediate response:

1. Server detail → Monitoring tab
   → Check memory usage trends
   → Identify spike or sustained increase patterns

2. Websh access
   → free -h (memory status)
   → ps aux --sort=-%mem | head -20
   → Check top 20 memory-consuming processes

3. Free temporary memory
   → Stop unnecessary services
   → Clear cache (drop_caches)
   → Verify swap activation

4. Long-term solutions
   - Memory leak → Code fix and deployment
   - Normal increase → Server upgrade

Scenario 3: Disk space shortage

Symptoms:

  • Disk usage above 90%
  • “No space left on device” errors
  • Log recording failures

Immediate response:

1. Server detail → Monitoring tab
   → Check disk usage
   → Identify per-partition usage status

2. Websh access
   → df -h (per-partition usage)
   → du -sh /* | sort -hr | head -10
   → Identify directories consuming the most space

3. Emergency space recovery
   → Delete old log files
   → Clean temporary files (/tmp, /var/tmp)
   → Remove unnecessary packages
   → Compress after backup

4. Configure log rotation
   → Check logrotate settings
   → Adjust log retention period
   → Set up automatic cleanup scripts

Scenario 4: Network traffic surge

Symptoms (paid plans):

  • Network traffic 300%+ above normal
  • Bandwidth limit reached
  • Response delays occurring

Immediate response:

1. Server detail → Monitoring tab
   → Check network traffic trends
   → Compare Peak vs AVG
   → Analyze per-interface traffic

2. Identify traffic causes
   → Analyze access logs
   → Check for DDoS attacks
   → Normal traffic surge vs abnormal traffic

3. Response measures
   Normal traffic surge:
   → Activate CDN
   → Strengthen caching
   → Add load balancers

   Abnormal traffic (DDoS):
   → Add firewall rules
   → Block IPs
   → Activate DDoS defense services

Scenario 5: Disk I/O bottleneck

Symptoms (paid plans):

  • High disk I/O wait
  • Query response slowdown
  • Application timeouts

Immediate response:

1. Server detail → Monitoring tab
   → Check disk I/O Peak/AVG
   → Analyze per-disk I/O

2. Websh access
   → iostat -x 1 10
   → Check I/O wait rate (%iowait)
   → iotop (identify high I/O processes)

3. Immediate measures
   → Pause I/O-intensive tasks
   → Reschedule backup/batch jobs
   → Minimize unnecessary file access

4. Long-term solutions
   - Index optimization (DB)
   - Upgrade to SSD
   - Add read replica (DB)
   - Introduce caching layer

Proactive prevention strategies

Baseline establishment

Record metric patterns during normal operations to quickly detect anomalies.

Baseline recording items:

Daily patterns:
- Business hours (9-18): Average CPU 30-40%
- Night time (0-6): Average CPU 10-15%
- Lunch time (12-13): Traffic decrease

Weekly patterns:
- Monday morning: Traffic increase (weekly batch)
- Wednesday evening: Service deployment
- Weekends: Low load

Monthly patterns:
- Start of month: CPU/DB load increase due to billing batch
- Disk usage: 5GB increase per month (logs)

Capacity planning

Analyze metric trends to secure resources proactively.

Planning method:

1. Collect past 3 months of data
   → Server detail → Monitoring tab
   → Record monthly Peak and AVG

2. Calculate growth rate
   → Disk: Average 10GB increase per month
   → Memory: 5% increase per month
   → Traffic: 15% increase per month

3. Predict threshold arrival time
   → Current disk 60% → 90% in 6 months
   → Plan capacity expansion 3 months ahead

4. Secure budget and execute

Regular inspection

Daily inspection:

  • Check and respond to critical alerts
  • Visually verify key server metrics
  • Check for sudden changes compared to previous day

Weekly inspection:

  • Analyze metric trends for all servers
  • Review unresolved warning alerts
  • Analyze log error patterns
  • Verify disk usage growth trends

Monthly inspection:

  • Readjust alert rule thresholds
  • Update baselines
  • Review capacity planning progress
  • Analyze and improve incident response times

Recommendations

Regular reviews

  • Daily: Check critical alerts and abnormal metrics
  • Weekly: Analyze per-server metric trends
  • Monthly: Review and adjust alert rule appropriateness

Efficient alert management

  • Minimize unnecessary alerts to prevent alert fatigue
  • Clearly distinguish between critical and informational alerts
  • Differentiate notification channels based on team roles

Documentation

  • Record normal operation baselines for each server
  • Document past incident cases and response methods
  • Create response manuals for alert occurrences