FugokuFugoku Docs
Mask

GPU Compute

NVIDIA H100 and A100 GPUs for AI/ML training, inference, and high-performance computing

GPU Compute

Fugoku offers bare-metal NVIDIA H100 and A100 GPUs for AI/ML training, inference, rendering, and compute-intensive workloads.

GPU Plans

All GPU instances include:

  • Bare-metal GPU access (no virtualization overhead)
  • CUDA 12.x pre-installed
  • cuDNN libraries
  • NVIDIA drivers (latest stable)
  • High-bandwidth CPU-GPU interconnect
  • NVMe SSD storage
  • 10 Gbps network

NVIDIA A100 Plans

Professional GPU for training and inference.

PlanGPUVRAMvCPURAMStoragePrice/hrPrice/mo*
gpu-a100-11x A100 40GB40 GB1264 GB500 GB$2.50$1,800
gpu-a100-22x A100 40GB80 GB24128 GB1 TB$5.00$3,600
gpu-a100-44x A100 40GB160 GB48256 GB2 TB$10.00$7,200
gpu-a100-88x A100 40GB320 GB96512 GB4 TB$20.00$14,400

*Based on 24/7 usage (720 hrs/month)

Specs:

  • Architecture: NVIDIA Ampere
  • FP32: 19.5 TFLOPS
  • FP16 (Tensor Cores): 312 TFLOPS
  • Memory Bandwidth: 1.6 TB/s
  • NVLink: 600 GB/s (multi-GPU)

Use cases:

  • Large language model training (up to 30B parameters)
  • Computer vision model training
  • Reinforcement learning
  • Batch inference at scale
  • Scientific computing

NVIDIA H100 Plans

Flagship GPU for cutting-edge AI research.

PlanGPUVRAMvCPURAMStoragePrice/hrPrice/mo*
gpu-h100-11x H100 80GB80 GB16128 GB1 TB$4.00$2,880
gpu-h100-22x H100 80GB160 GB32256 GB2 TB$8.00$5,760
gpu-h100-44x H100 80GB320 GB64512 GB4 TB$16.00$11,520
gpu-h100-88x H100 80GB640 GB1281 TB8 TB$32.00$23,040

Specs:

  • Architecture: NVIDIA Hopper
  • FP32: 67 TFLOPS
  • FP16 (Tensor Cores): 1,979 TFLOPS
  • FP8 (Transformer Engine): 3,958 TFLOPS
  • Memory Bandwidth: 3.35 TB/s
  • NVLink 4.0: 900 GB/s (multi-GPU)

Use cases:

  • Large language models (100B+ parameters)
  • Diffusion models (Stable Diffusion, DALL-E)
  • Multi-modal models (CLIP, Flamingo)
  • Large-scale training with DeepSpeed/Megatron
  • High-throughput inference

Pricing Models

On-Demand (default):

  • Hourly billing, pay per second
  • No commitment
  • Full flexibility

Reserved (coming Q3 2026):

  • 1-year or 3-year commitment
  • 30-50% discount
  • Guaranteed availability

Spot (coming Q4 2026):

  • Spare capacity, up to 70% discount
  • Can be interrupted with 30-second warning
  • Great for fault-tolerant training jobs

Creating a GPU Instance

Via Console

  1. Navigate to AI & ML → GPU Instances
  2. Click Create GPU Instance
  3. Select:
    • GPU Type: A100 or H100
    • Configuration: Number of GPUs
    • Region: lagos-1, london-1, frankfurt-1
    • Image: PyTorch, TensorFlow, or blank Ubuntu
    • SSH Keys: For access
  4. Click Create

Provisioning time: 2-5 minutes (longer than CPU instances).

Via CLI

# Single A100 with PyTorch
fugoku create instance \
  --name train-job-1 \
  --plan gpu-a100-1 \
  --image pytorch-2.0-cuda12 \
  --region lagos-1 \
  --ssh-key laptop

# 8x H100 for large-scale training
fugoku create instance \
  --name llm-training \
  --plan gpu-h100-8 \
  --image ubuntu-22.04-cuda12 \
  --region frankfurt-1 \
  --ssh-key laptop

Via API

curl -X POST https://api.fugoku.com/v1/instances \
  -H "Authorization: Bearer $FUGOKU_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "train-job-1",
    "plan": "gpu-a100-1",
    "image": "pytorch-2.0-cuda12",
    "region": "lagos-1",
    "ssh_keys": ["laptop"]
  }'

Pre-Configured Images

PyTorch

  • Image: pytorch-2.0-cuda12
  • Includes:
    • PyTorch 2.0.1 with CUDA 12.1
    • torchvision, torchaudio
    • Transformers (Hugging Face)
    • NVIDIA Apex
    • Jupyter Lab
    • Common ML libraries (numpy, pandas, scikit-learn)
# Launch and verify
fugoku ssh train-job-1
python3 -c "import torch; print(torch.cuda.is_available())"
# True
python3 -c "import torch; print(torch.cuda.get_device_name(0))"
# NVIDIA A100-SXM4-40GB

TensorFlow

  • Image: tensorflow-2.13-cuda12
  • Includes:
    • TensorFlow 2.13 with CUDA 12.1
    • Keras
    • TensorBoard
    • Jupyter Lab
    • Common ML libraries
# Verify
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
# [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

JAX

  • Image: jax-0.4-cuda12
  • Includes:
    • JAX 0.4.13
    • Flax
    • Optax
    • Jupyter Lab
# Verify
python3 -c "import jax; print(jax.devices())"
# [GpuDevice(id=0)]

Blank CUDA

  • Image: ubuntu-22.04-cuda12
  • Includes:
    • Ubuntu 22.04
    • CUDA 12.1
    • cuDNN 8.9
    • NVIDIA drivers
    • Build tools (gcc, make, cmake)

Perfect for custom environments or frameworks not pre-packaged.

Example Workflows

Training a Model with PyTorch

# Create GPU instance
fugoku create instance \
  --name pytorch-train \
  --plan gpu-a100-1 \
  --image pytorch-2.0-cuda12 \
  --region lagos-1

# SSH into instance
fugoku ssh pytorch-train

# Clone your training repo
git clone https://github.com/yourname/model-training.git
cd model-training

# Install additional dependencies
pip install -r requirements.txt

# Run training
python train.py \
  --batch-size 64 \
  --epochs 100 \
  --lr 0.001 \
  --gpu 0

# Monitor GPU usage
nvidia-smi -l 1

Distributed Training (Multi-GPU)

Using PyTorch DistributedDataParallel:

# train.py
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP

def setup(rank, world_size):
    dist.init_process_group("nccl", rank=rank, world_size=world_size)

def train(rank, world_size):
    setup(rank, world_size)
    
    model = YourModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])
    
    # Training loop
    for epoch in range(num_epochs):
        for batch in dataloader:
            outputs = ddp_model(batch)
            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()
    
    dist.destroy_process_group()

if __name__ == "__main__":
    world_size = 4  # 4 GPUs
    mp.spawn(train, args=(world_size,), nprocs=world_size, join=True)

Run on 4-GPU instance:

python train.py

Fine-Tuning with Hugging Face

# SSH into instance
fugoku ssh pytorch-train

# Install transformers
pip install transformers datasets accelerate

# Fine-tune BERT
python -m torch.distributed.launch \
  --nproc_per_node=1 \
  run_glue.py \
  --model_name_or_path bert-base-uncased \
  --task_name mnli \
  --do_train \
  --do_eval \
  --max_seq_length 128 \
  --per_device_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3 \
  --output_dir /tmp/mnli_output

Inference Server

Deploy trained model for inference:

# inference.py
import torch
from flask import Flask, request, jsonify

app = Flask(__name__)

# Load model
model = torch.load('model.pth')
model.eval()
model = model.cuda()

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json['data']
    tensor = torch.tensor(data).cuda()
    
    with torch.no_grad():
        output = model(tensor)
    
    return jsonify({'prediction': output.cpu().numpy().tolist()})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

Run server:

python inference.py

Test from client:

curl -X POST http://gpu-instance-ip:5000/predict \
  -H "Content-Type: application/json" \
  -d '{"data": [1.0, 2.0, 3.0]}'

Monitoring GPU Usage

nvidia-smi

Real-time GPU monitoring:

# One-time snapshot
nvidia-smi

# Continuous monitoring (1 second refresh)
nvidia-smi -l 1

# Compact format
nvidia-smi --query-gpu=index,name,temperature.gpu,utilization.gpu,utilization.memory,memory.used,memory.total --format=csv -l 1

Output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB   On   | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P0   120W / 400W |  32768MiB / 40960MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

Fugoku Console Metrics

Navigate to instance detail page → GPU Metrics tab:

  • GPU utilization (%)
  • Memory utilization (%)
  • Temperature (°C)
  • Power usage (W)
  • Graphs: 1h, 6h, 24h, 7d

Programmatic Monitoring

import subprocess
import json

def get_gpu_stats():
    result = subprocess.run([
        'nvidia-smi',
        '--query-gpu=index,utilization.gpu,memory.used,memory.total,temperature.gpu',
        '--format=csv,noheader,nounits'
    ], capture_output=True, text=True)
    
    lines = result.stdout.strip().split('\n')
    stats = []
    for line in lines:
        idx, util, mem_used, mem_total, temp = line.split(', ')
        stats.append({
            'gpu': int(idx),
            'utilization': float(util),
            'memory_used': int(mem_used),
            'memory_total': int(mem_total),
            'temperature': float(temp)
        })
    return stats

# Usage
print(json.dumps(get_gpu_stats(), indent=2))

Storage for ML Workflows

Root Disk

GPU instances include large root disks (500GB - 8TB).

For datasets larger than root disk, attach block volumes:

# Create 2 TB volume
fugoku volumes create \
  --name datasets \
  --size 2000 \
  --region lagos-1

# Attach to GPU instance
fugoku volumes attach datasets --instance pytorch-train

# Format and mount
ssh into instance:
sudo mkfs.ext4 /dev/vdb
sudo mkdir /data
sudo mount /dev/vdb /data
echo '/dev/vdb /data ext4 defaults 0 2' | sudo tee -a /etc/fstab

Object Storage (coming Q3 2026)

S3-compatible object storage for datasets and checkpoints:

import boto3

s3 = boto3.client('s3',
    endpoint_url='https://s3.fugoku.com',
    aws_access_key_id='your_key',
    aws_secret_access_key='your_secret'
)

# Upload checkpoints during training
s3.upload_file('checkpoint_epoch_10.pth', 'my-bucket', 'checkpoints/checkpoint_10.pth')

# Download datasets at training start
s3.download_file('my-bucket', 'datasets/imagenet.tar', 'imagenet.tar')

Jupyter Lab

PyTorch and TensorFlow images include Jupyter Lab pre-installed.

Start Jupyter:

# SSH into instance
fugoku ssh pytorch-train

# Start Jupyter (accessible from localhost:8888)
jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root

# Output:
# [I 2024-02-25 10:30:00.123 ServerApp] Jupyter Server is running at:
# [I 2024-02-25 10:30:00.123 ServerApp] http://0.0.0.0:8888/lab?token=abc123...

Access from your laptop:

# SSH tunnel
ssh -L 8888:localhost:8888 ubuntu@gpu-instance-ip

# Open in browser
open http://localhost:8888/lab?token=abc123...

Or set up password:

jupyter lab password
# Enter password twice

jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root

Best Practices

Cost Optimization

Stop instances when not training: GPU instances are expensive - stop when idle.

# Stop instance (keeps all data)
fugoku instances stop pytorch-train

# Resume later
fugoku instances start pytorch-train

Use spot instances (when available): Up to 70% cheaper for fault-tolerant jobs.

Checkpoint frequently: Save model checkpoints every N epochs - resume if spot instance interrupted.

# Save checkpoint every 10 epochs
if epoch % 10 == 0:
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': loss,
    }, f'checkpoint_epoch_{epoch}.pth')

Performance Optimization

Use mixed precision training: FP16 reduces memory usage and increases throughput.

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for batch in dataloader:
    optimizer.zero_grad()
    
    with autocast():
        outputs = model(batch)
        loss = criterion(outputs, targets)
    
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Pin memory in DataLoader:

dataloader = DataLoader(
    dataset,
    batch_size=64,
    num_workers=4,
    pin_memory=True  # Faster CPU->GPU transfer
)

Use gradient accumulation: Simulate larger batch sizes without OOM:

accumulation_steps = 4

for i, batch in enumerate(dataloader):
    outputs = model(batch)
    loss = criterion(outputs, targets)
    loss = loss / accumulation_steps
    loss.backward()
    
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Multi-GPU Training

Data parallelism: Replicate model across GPUs, split data.

model = nn.DataParallel(model)

Distributed data parallelism (recommended): More efficient, use DistributedDataParallel.

Model parallelism: Split large models across GPUs (for models that don't fit in single GPU memory).

See PyTorch docs for details.

Troubleshooting

CUDA out of memory:

  • Reduce batch size
  • Enable gradient checkpointing
  • Use mixed precision
  • Clear cache: torch.cuda.empty_cache()

Slow training:

  • Check GPU utilization: nvidia-smi - should be 90-100%
  • Profile with PyTorch profiler
  • Check data loading bottleneck - use more workers
  • Verify disk I/O isn't limiting

Driver issues: Drivers are pre-installed and tested. If issues:

# Check driver
nvidia-smi

# Reinstall if needed (contact support first)
sudo apt install --reinstall nvidia-driver-525

Getting Help

GPU instance issues? We're here:

Common support requests:

  • Driver/CUDA version conflicts
  • Multi-GPU configuration
  • Performance optimization
  • Custom image requests

Next Steps:

On this page