Troubleshooting v1.0.0

Diagnose and resolve common issues with the SLYD platform, compute instances, and applications.

About This Guide

This troubleshooting guide addresses common issues you might encounter when using SLYD. If you can't find a solution to your problem here, please contact our support team or ask Benson for assistance.

Compute Instance Issues

Instance Fails to Start

High Impact

Your instance remains in a "Starting" or "Error" state and fails to become available after several minutes.

Possible Causes

  • Resource availability constraints in the selected region
  • Issues with the base image or snapshot
  • Insufficient quota for the instance type
  • Infrastructure maintenance or outage

Resolution Steps

  1. Check the instance status details

    Navigate to the instance details page and check the status message for specific error information.

  2. Verify your account quota

    Go to Account → Quotas to ensure you haven't exceeded your instance limit for the region or instance type.

  3. Try a different region

    If resources are constrained in your selected region, try deploying the instance in a different region.

  4. Check the system status page

    Visit status.slyd.cloud to see if there are any ongoing infrastructure issues.

  5. Restart the instance creation process

    Delete the failed instance and create a new one, potentially with a different image or size.

Diagnostic Information

Check the Events tab in your instance details for specific error codes. Common error codes include:

  • INSUFFICIENT_CAPACITY: The region doesn't have enough resources available
  • QUOTA_EXCEEDED: You've reached your account limit for this resource type
  • IMAGE_CORRUPT: The selected image has issues and can't be deployed
  • HARDWARE_FAILURE: Physical hardware issues in the data center

When to Contact Support

If you've tried all the steps above and your instance still fails to start, contact support with the following information:

  • Instance ID
  • Region
  • Instance type
  • Time of failure
  • Error codes from the Events tab

Instance Suddenly Stopped or Restarted

High Impact

A running instance unexpectedly stopped, restarted, or is experiencing unexpected reboots.

Possible Causes

  • Resource exhaustion (CPU, memory, disk)
  • Host system maintenance or issues
  • Operating system crashes or kernel panics
  • Security systems terminating problematic processes
  • Payment or billing issues leading to service suspension

Resolution Steps

  1. Check instance metrics

    Review resource usage metrics to identify potential resource exhaustion issues.

  2. Review system logs

    Connect to the instance and check system logs for error messages, OOM (Out of Memory) events, or crash reports.

    # Check system logs
    sudo journalctl -xb -p err
    
    # Check for out of memory events
    sudo grep -i "out of memory" /var/log/syslog
    
    # Check kernel logs
    sudo dmesg | grep -i error
                                    
  3. Verify billing status

    Check your account billing page to ensure no payment issues or account suspensions.

  4. Check resource usage and limits

    Ensure your application isn't exceeding the instance's resource limits.

    # Check current resource usage
    htop
    
    # Check disk space
    df -h
    
    # Check for processes consuming excessive resources
    ps aux --sort=-%mem | head -10
                                    
  5. Enable crash reporting

    Configure the operating system to preserve crash dumps for analysis.

Diagnostic Information

The instance Events tab will show restart events and any automated actions taken by the platform. Key files to examine include:

  • /var/log/syslog: General system logs
  • /var/log/kern.log: Kernel-related messages
  • /var/crash/: System crash dumps (if enabled)
  • ~/.pm2/logs/: PM2 process manager logs (if using Node.js)
  • /var/log/nginx/: Nginx web server logs

Preventative Measures

  • Implement proper resource monitoring with alerts for high usage
  • Configure auto-scaling for applications with variable loads
  • Use load testing to identify resource bottlenecks before production
  • Implement proper error handling in applications
  • Keep operating systems and applications updated

Instance Running Slowly

Medium Impact

Your instance is operational but experiencing poor performance, high latency, or slow response times.

Possible Causes

  • Resource contention (CPU, memory, disk I/O)
  • Network congestion or limitations
  • Inefficient application code or configuration
  • Database performance issues
  • Background processes consuming resources
  • Instance size inadequate for workload

Resolution Steps

  1. Identify resource bottlenecks

    Use monitoring tools to identify which resources are constrained.

    # Check CPU, memory, and process information
    htop
    
    # Check disk I/O performance
    iostat -x 1
    
    # Check network performance
    iftop
                                    
  2. Optimize application configuration

    Adjust application settings to better utilize available resources.

  3. Check for resource-intensive processes

    Identify and optimize or terminate unnecessary processes.

  4. Optimize database operations

    Review and optimize database queries, indexes, and connection pooling.

  5. Resize the instance

    If the workload consistently exceeds current resources, upgrade to a larger instance size.

Performance Analysis Tools

SLYD provides several built-in tools to help diagnose performance issues:

  • Resource Monitoring Dashboard: Real-time metrics for CPU, memory, disk, and network
  • Performance Insights: Historical performance data and trend analysis
  • Benson Performance Advisor: AI-powered recommendations for performance optimization

Additionally, you can install these common performance analysis tools:

# Install common performance tools
sudo apt update
sudo apt install htop iotop iftop sysstat netdata
                        
Benson Recommends

"Use the Performance Score feature in your instance dashboard to get an overall health rating and specific recommendations. Scores below 70 indicate potential issues that should be addressed."

"Consider enabling auto-scaling if your workload has variable demand patterns. This allows your resources to adjust automatically based on actual usage."

Networking Issues

Cannot Connect to Instance

High Impact

Unable to connect to an instance via SSH, web interface, or application endpoints.

Possible Causes

  • Instance firewall rules blocking connections
  • Cloudflare tunnel configuration issues
  • SSH key or authentication problems
  • Instance network interface issues
  • Application or service not running on expected port
  • DNS propagation delays

Resolution Steps

  1. Verify instance status

    Ensure the instance is in "Running" state and has passed all health checks.

  2. Check tunnel status

    Navigate to Networking → Tunnels to verify the tunnel is active and properly configured.

  3. Verify firewall rules

    Check that your instance's firewall allows traffic on the required ports.

  4. Test connection using SLYD Console

    Use the web-based terminal in the SLYD Console to access your instance directly, bypassing external networking.

  5. Check service status

    If you can access the instance but not a specific service, verify the service is running:

    # Check listening ports
    sudo netstat -tulpn | grep LISTEN
    
    # Check SSH service
    sudo systemctl status sshd
    
    # Check web server
    sudo systemctl status nginx  # or apache2, etc.
                                    

Diagnostic Information

If you cannot connect via SSH, try these diagnostic steps:

  1. Enable verbose SSH logging
    # Add -v (up to -vvv for maximum verbosity)
    ssh -v [email protected]
                                    
  2. Check DNS resolution
    # Verify DNS resolution
    nslookup instance-xyz.slyd.dev
    
    # Check connection with specific timeout
    nc -zv instance-xyz.slyd.dev 22 -w 5
                                    

Connection Troubleshooting Flow

Connection Troubleshooting Flowchart

Slow Network Performance

Medium Impact

Instance is experiencing high latency, slow data transfer rates, or intermittent connectivity.

Possible Causes

  • Geographic distance between client and instance region
  • Network congestion in data center or Cloudflare network
  • Bandwidth throttling or limitations
  • Large data transfers saturating available bandwidth
  • DNS or CDN configuration issues
  • DDoS protection temporarily limiting connections

Resolution Steps

  1. Run network diagnostics

    Test network speed and latency from your instance.

    # Install network tools
    sudo apt install speedtest-cli mtr
    
    # Run speed test
    speedtest-cli
    
    # Run traceroute with timing
    mtr -rw google.com
                                    
  2. Check for network-intensive processes

    Identify processes that might be consuming excessive bandwidth.

    # Monitor network usage by process
    sudo apt install nethogs
    sudo nethogs
                                    
  3. Optimize content delivery

    For web applications, enable compression and caching.

  4. Consider a different region

    If latency is consistently high, consider deploying to a region closer to your users.

Network Performance Benchmarking

Compare your network performance against SLYD benchmarks:

Metric Expected Range Poor Performance
Download Speed 500-2000 Mbps < 200 Mbps
Upload Speed 400-1500 Mbps < 150 Mbps
Latency (Same Region) 1-5 ms > 10 ms
Latency (Cross-Region) 20-100 ms > 150 ms

Network Optimization Tips

  • Enable HTTP/2 for web applications to improve connection efficiency
  • Use a content delivery network (CDN) for static assets
  • Implement proper caching headers for web content
  • Consider enabling Cloudflare's Argo Smart Routing for improved performance
  • Schedule large data transfers during off-peak hours
  • Compress data before transmission for large file transfers

Application Issues

Application Deployment Failure

Medium Impact

Unable to deploy an application from the marketplace or custom deployment fails.

Possible Causes

  • Insufficient resources for the application
  • Dependency conflicts or missing prerequisites
  • Network issues during package download
  • Incompatible application versions
  • Incorrect configuration parameters

Resolution Steps

  1. Review deployment logs

    Check the deployment logs for specific error messages.

  2. Verify resource requirements

    Ensure your instance meets the minimum requirements for the application.

  3. Check for dependency conflicts

    If deploying multiple applications, ensure they don't have conflicting dependencies.

  4. Try manual installation

    If marketplace deployment fails, try installing the application manually to identify specific issues.

  5. Check application compatibility

    Verify the application is compatible with your instance's operating system and environment.

Common Deployment Errors

ERROR_INSUFFICIENT_MEMORY

The instance does not have enough memory to run the application.

Solution: Resize your instance to a larger memory configuration or optimize the application's memory usage.

ERROR_DEPENDENCY_CONFLICT

The application has dependency conflicts with existing software.

Solution: Use container isolation or virtual environments to separate applications with conflicting dependencies.

ERROR_NETWORK_TIMEOUT

Network timeout occurred during package download.

Solution: Check network connectivity, try again later, or use a mirror repository if available.

Application Crashes or Exits Unexpectedly

High Impact

Application starts but crashes or exits unexpectedly during operation.

Possible Causes

  • Application bugs or code errors
  • Resource exhaustion (memory leaks, CPU spikes)
  • Missing or corrupted configuration files
  • Unexpected input or data corruption
  • External service dependencies unavailable
  • System signals terminating the process

Resolution Steps

  1. Check application logs

    Review application logs for error messages or exceptions.

    # Common log locations
    # Web server logs
    sudo tail -f /var/log/nginx/error.log
    
    # Application-specific logs
    sudo tail -f /var/log/your-application/error.log
    
    # System journal for service errors
    sudo journalctl -u your-application-service -f
                                    
  2. Monitor resource usage

    Check if the application is hitting resource limits before crashing.

  3. Implement process monitoring

    Use a process manager to automatically restart crashed applications.

    # For Node.js applications
    npm install -g pm2
    pm2 start app.js --name "myapp" --watch
    
    # For Python applications
    pip install supervisor
    supervisord -c /etc/supervisor/supervisord.conf
                                    
  4. Check external dependencies

    Verify all external services the application depends on are available and responding.

  5. Update the application

    Check for application updates that might address known stability issues.

Advanced Debugging Techniques

  • Core dump analysis

    Enable and analyze core dumps to identify crash causes.

    # Enable core dumps
    ulimit -c unlimited
    echo '/tmp/core.%e.%p' | sudo tee /proc/sys/kernel/core_pattern
    
    # Analyze core dump (example for C/C++ applications)
    gdb /path/to/application /tmp/core.applicationname.1234
                                    
  • Application profiling

    Use profiling tools to identify performance bottlenecks or memory leaks.

  • Process tracing

    Trace system calls and signals to understand application behavior.

    # Trace system calls
    strace -f -p [PROCESS_ID]
    
    # Trace only specific calls
    strace -e open,read,write -p [PROCESS_ID]
                                    

Billing & Account Issues

Unexpected or High Charges

Medium Impact

Your bill is higher than expected or contains charges you don't recognize.

Possible Causes

  • Instances left running when no longer needed
  • Resource-intensive applications consuming more than expected
  • Unoptimized storage usage or unnecessary snapshots
  • High network data transfer costs
  • Additional services or add-ons enabled

Resolution Steps

  1. Review billing details

    Examine your billing statement for a breakdown of charges by resource type and instance.

  2. Check active resources

    Inventory all running instances, stored snapshots, and persistent volumes.

  3. Analyze usage patterns

    Review resource utilization graphs to identify unexpected spikes or continuous high usage.

  4. Implement cost controls

    Set up budget alerts and usage notifications to prevent future surprises.

  5. Optimize resource usage

    Rightsize instances, delete unnecessary snapshots, and optimize storage.

Cost Optimization Tips

  • Shut down development instances when not in use
  • Use resource scheduling to automatically start/stop instances based on schedules
  • Implement snapshot lifecycle policies to automatically delete old snapshots
  • Monitor network transfer costs and optimize data flow
  • Right-size your instances based on actual resource utilization
  • Use Benson's cost optimization recommendations for instance-specific advice
Benson Recommends

"I've analyzed your usage patterns and noticed that you could save approximately 35% by using a scheduled shutdown policy for your development instances during non-working hours (nights and weekends)."

"Consider using resource tagging to track costs by project or department, making it easier to identify where optimization opportunities exist."

An unhandled error has occurred. Reload 🗙