GPU Compute
NVIDIA H100 and A100 GPUs for AI/ML training, inference, and high-performance computing
GPU Compute
Fugoku offers bare-metal NVIDIA H100 and A100 GPUs for AI/ML training, inference, rendering, and compute-intensive workloads.
GPU Plans
All GPU instances include:
- Bare-metal GPU access (no virtualization overhead)
- CUDA 12.x pre-installed
- cuDNN libraries
- NVIDIA drivers (latest stable)
- High-bandwidth CPU-GPU interconnect
- NVMe SSD storage
- 10 Gbps network
NVIDIA A100 Plans
Professional GPU for training and inference.
| Plan | GPU | VRAM | vCPU | RAM | Storage | Price/hr | Price/mo* |
|---|---|---|---|---|---|---|---|
| gpu-a100-1 | 1x A100 40GB | 40 GB | 12 | 64 GB | 500 GB | $2.50 | $1,800 |
| gpu-a100-2 | 2x A100 40GB | 80 GB | 24 | 128 GB | 1 TB | $5.00 | $3,600 |
| gpu-a100-4 | 4x A100 40GB | 160 GB | 48 | 256 GB | 2 TB | $10.00 | $7,200 |
| gpu-a100-8 | 8x A100 40GB | 320 GB | 96 | 512 GB | 4 TB | $20.00 | $14,400 |
*Based on 24/7 usage (720 hrs/month)
Specs:
- Architecture: NVIDIA Ampere
- FP32: 19.5 TFLOPS
- FP16 (Tensor Cores): 312 TFLOPS
- Memory Bandwidth: 1.6 TB/s
- NVLink: 600 GB/s (multi-GPU)
Use cases:
- Large language model training (up to 30B parameters)
- Computer vision model training
- Reinforcement learning
- Batch inference at scale
- Scientific computing
NVIDIA H100 Plans
Flagship GPU for cutting-edge AI research.
| Plan | GPU | VRAM | vCPU | RAM | Storage | Price/hr | Price/mo* |
|---|---|---|---|---|---|---|---|
| gpu-h100-1 | 1x H100 80GB | 80 GB | 16 | 128 GB | 1 TB | $4.00 | $2,880 |
| gpu-h100-2 | 2x H100 80GB | 160 GB | 32 | 256 GB | 2 TB | $8.00 | $5,760 |
| gpu-h100-4 | 4x H100 80GB | 320 GB | 64 | 512 GB | 4 TB | $16.00 | $11,520 |
| gpu-h100-8 | 8x H100 80GB | 640 GB | 128 | 1 TB | 8 TB | $32.00 | $23,040 |
Specs:
- Architecture: NVIDIA Hopper
- FP32: 67 TFLOPS
- FP16 (Tensor Cores): 1,979 TFLOPS
- FP8 (Transformer Engine): 3,958 TFLOPS
- Memory Bandwidth: 3.35 TB/s
- NVLink 4.0: 900 GB/s (multi-GPU)
Use cases:
- Large language models (100B+ parameters)
- Diffusion models (Stable Diffusion, DALL-E)
- Multi-modal models (CLIP, Flamingo)
- Large-scale training with DeepSpeed/Megatron
- High-throughput inference
Pricing Models
On-Demand (default):
- Hourly billing, pay per second
- No commitment
- Full flexibility
Reserved (coming Q3 2026):
- 1-year or 3-year commitment
- 30-50% discount
- Guaranteed availability
Spot (coming Q4 2026):
- Spare capacity, up to 70% discount
- Can be interrupted with 30-second warning
- Great for fault-tolerant training jobs
Creating a GPU Instance
Via Console
- Navigate to AI & ML → GPU Instances
- Click Create GPU Instance
- Select:
- GPU Type: A100 or H100
- Configuration: Number of GPUs
- Region: lagos-1, london-1, frankfurt-1
- Image: PyTorch, TensorFlow, or blank Ubuntu
- SSH Keys: For access
- Click Create
Provisioning time: 2-5 minutes (longer than CPU instances).
Via CLI
# Single A100 with PyTorch
fugoku create instance \
--name train-job-1 \
--plan gpu-a100-1 \
--image pytorch-2.0-cuda12 \
--region lagos-1 \
--ssh-key laptop
# 8x H100 for large-scale training
fugoku create instance \
--name llm-training \
--plan gpu-h100-8 \
--image ubuntu-22.04-cuda12 \
--region frankfurt-1 \
--ssh-key laptopVia API
curl -X POST https://api.fugoku.com/v1/instances \
-H "Authorization: Bearer $FUGOKU_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "train-job-1",
"plan": "gpu-a100-1",
"image": "pytorch-2.0-cuda12",
"region": "lagos-1",
"ssh_keys": ["laptop"]
}'Pre-Configured Images
PyTorch
- Image: pytorch-2.0-cuda12
- Includes:
- PyTorch 2.0.1 with CUDA 12.1
- torchvision, torchaudio
- Transformers (Hugging Face)
- NVIDIA Apex
- Jupyter Lab
- Common ML libraries (numpy, pandas, scikit-learn)
# Launch and verify
fugoku ssh train-job-1
python3 -c "import torch; print(torch.cuda.is_available())"
# True
python3 -c "import torch; print(torch.cuda.get_device_name(0))"
# NVIDIA A100-SXM4-40GBTensorFlow
- Image: tensorflow-2.13-cuda12
- Includes:
- TensorFlow 2.13 with CUDA 12.1
- Keras
- TensorBoard
- Jupyter Lab
- Common ML libraries
# Verify
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
# [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]JAX
- Image: jax-0.4-cuda12
- Includes:
- JAX 0.4.13
- Flax
- Optax
- Jupyter Lab
# Verify
python3 -c "import jax; print(jax.devices())"
# [GpuDevice(id=0)]Blank CUDA
- Image: ubuntu-22.04-cuda12
- Includes:
- Ubuntu 22.04
- CUDA 12.1
- cuDNN 8.9
- NVIDIA drivers
- Build tools (gcc, make, cmake)
Perfect for custom environments or frameworks not pre-packaged.
Example Workflows
Training a Model with PyTorch
# Create GPU instance
fugoku create instance \
--name pytorch-train \
--plan gpu-a100-1 \
--image pytorch-2.0-cuda12 \
--region lagos-1
# SSH into instance
fugoku ssh pytorch-train
# Clone your training repo
git clone https://github.com/yourname/model-training.git
cd model-training
# Install additional dependencies
pip install -r requirements.txt
# Run training
python train.py \
--batch-size 64 \
--epochs 100 \
--lr 0.001 \
--gpu 0
# Monitor GPU usage
nvidia-smi -l 1Distributed Training (Multi-GPU)
Using PyTorch DistributedDataParallel:
# train.py
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP
def setup(rank, world_size):
dist.init_process_group("nccl", rank=rank, world_size=world_size)
def train(rank, world_size):
setup(rank, world_size)
model = YourModel().to(rank)
ddp_model = DDP(model, device_ids=[rank])
# Training loop
for epoch in range(num_epochs):
for batch in dataloader:
outputs = ddp_model(batch)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
dist.destroy_process_group()
if __name__ == "__main__":
world_size = 4 # 4 GPUs
mp.spawn(train, args=(world_size,), nprocs=world_size, join=True)Run on 4-GPU instance:
python train.pyFine-Tuning with Hugging Face
# SSH into instance
fugoku ssh pytorch-train
# Install transformers
pip install transformers datasets accelerate
# Fine-tune BERT
python -m torch.distributed.launch \
--nproc_per_node=1 \
run_glue.py \
--model_name_or_path bert-base-uncased \
--task_name mnli \
--do_train \
--do_eval \
--max_seq_length 128 \
--per_device_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 3 \
--output_dir /tmp/mnli_outputInference Server
Deploy trained model for inference:
# inference.py
import torch
from flask import Flask, request, jsonify
app = Flask(__name__)
# Load model
model = torch.load('model.pth')
model.eval()
model = model.cuda()
@app.route('/predict', methods=['POST'])
def predict():
data = request.json['data']
tensor = torch.tensor(data).cuda()
with torch.no_grad():
output = model(tensor)
return jsonify({'prediction': output.cpu().numpy().tolist()})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)Run server:
python inference.pyTest from client:
curl -X POST http://gpu-instance-ip:5000/predict \
-H "Content-Type: application/json" \
-d '{"data": [1.0, 2.0, 3.0]}'Monitoring GPU Usage
nvidia-smi
Real-time GPU monitoring:
# One-time snapshot
nvidia-smi
# Continuous monitoring (1 second refresh)
nvidia-smi -l 1
# Compact format
nvidia-smi --query-gpu=index,name,temperature.gpu,utilization.gpu,utilization.memory,memory.used,memory.total --format=csv -l 1Output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB On | 00000000:00:04.0 Off | 0 |
| N/A 42C P0 120W / 400W | 32768MiB / 40960MiB | 100% Default |
+-------------------------------+----------------------+----------------------+Fugoku Console Metrics
Navigate to instance detail page → GPU Metrics tab:
- GPU utilization (%)
- Memory utilization (%)
- Temperature (°C)
- Power usage (W)
- Graphs: 1h, 6h, 24h, 7d
Programmatic Monitoring
import subprocess
import json
def get_gpu_stats():
result = subprocess.run([
'nvidia-smi',
'--query-gpu=index,utilization.gpu,memory.used,memory.total,temperature.gpu',
'--format=csv,noheader,nounits'
], capture_output=True, text=True)
lines = result.stdout.strip().split('\n')
stats = []
for line in lines:
idx, util, mem_used, mem_total, temp = line.split(', ')
stats.append({
'gpu': int(idx),
'utilization': float(util),
'memory_used': int(mem_used),
'memory_total': int(mem_total),
'temperature': float(temp)
})
return stats
# Usage
print(json.dumps(get_gpu_stats(), indent=2))Storage for ML Workflows
Root Disk
GPU instances include large root disks (500GB - 8TB).
For datasets larger than root disk, attach block volumes:
# Create 2 TB volume
fugoku volumes create \
--name datasets \
--size 2000 \
--region lagos-1
# Attach to GPU instance
fugoku volumes attach datasets --instance pytorch-train
# Format and mount
ssh into instance:
sudo mkfs.ext4 /dev/vdb
sudo mkdir /data
sudo mount /dev/vdb /data
echo '/dev/vdb /data ext4 defaults 0 2' | sudo tee -a /etc/fstabObject Storage (coming Q3 2026)
S3-compatible object storage for datasets and checkpoints:
import boto3
s3 = boto3.client('s3',
endpoint_url='https://s3.fugoku.com',
aws_access_key_id='your_key',
aws_secret_access_key='your_secret'
)
# Upload checkpoints during training
s3.upload_file('checkpoint_epoch_10.pth', 'my-bucket', 'checkpoints/checkpoint_10.pth')
# Download datasets at training start
s3.download_file('my-bucket', 'datasets/imagenet.tar', 'imagenet.tar')Jupyter Lab
PyTorch and TensorFlow images include Jupyter Lab pre-installed.
Start Jupyter:
# SSH into instance
fugoku ssh pytorch-train
# Start Jupyter (accessible from localhost:8888)
jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root
# Output:
# [I 2024-02-25 10:30:00.123 ServerApp] Jupyter Server is running at:
# [I 2024-02-25 10:30:00.123 ServerApp] http://0.0.0.0:8888/lab?token=abc123...Access from your laptop:
# SSH tunnel
ssh -L 8888:localhost:8888 ubuntu@gpu-instance-ip
# Open in browser
open http://localhost:8888/lab?token=abc123...Or set up password:
jupyter lab password
# Enter password twice
jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-rootBest Practices
Cost Optimization
Stop instances when not training: GPU instances are expensive - stop when idle.
# Stop instance (keeps all data)
fugoku instances stop pytorch-train
# Resume later
fugoku instances start pytorch-trainUse spot instances (when available): Up to 70% cheaper for fault-tolerant jobs.
Checkpoint frequently: Save model checkpoints every N epochs - resume if spot instance interrupted.
# Save checkpoint every 10 epochs
if epoch % 10 == 0:
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}, f'checkpoint_epoch_{epoch}.pth')Performance Optimization
Use mixed precision training: FP16 reduces memory usage and increases throughput.
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for batch in dataloader:
optimizer.zero_grad()
with autocast():
outputs = model(batch)
loss = criterion(outputs, targets)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()Pin memory in DataLoader:
dataloader = DataLoader(
dataset,
batch_size=64,
num_workers=4,
pin_memory=True # Faster CPU->GPU transfer
)Use gradient accumulation: Simulate larger batch sizes without OOM:
accumulation_steps = 4
for i, batch in enumerate(dataloader):
outputs = model(batch)
loss = criterion(outputs, targets)
loss = loss / accumulation_steps
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()Multi-GPU Training
Data parallelism: Replicate model across GPUs, split data.
model = nn.DataParallel(model)Distributed data parallelism (recommended): More efficient, use DistributedDataParallel.
Model parallelism: Split large models across GPUs (for models that don't fit in single GPU memory).
See PyTorch docs for details.
Troubleshooting
CUDA out of memory:
- Reduce batch size
- Enable gradient checkpointing
- Use mixed precision
- Clear cache:
torch.cuda.empty_cache()
Slow training:
- Check GPU utilization:
nvidia-smi- should be 90-100% - Profile with PyTorch profiler
- Check data loading bottleneck - use more workers
- Verify disk I/O isn't limiting
Driver issues: Drivers are pre-installed and tested. If issues:
# Check driver
nvidia-smi
# Reinstall if needed (contact support first)
sudo apt install --reinstall nvidia-driver-525Getting Help
GPU instance issues? We're here:
- Email: support@fugoku.com (include instance ID)
- Discord: #gpu-compute channel
- Documentation: docs.fugoku.com/gpu
Common support requests:
- Driver/CUDA version conflicts
- Multi-GPU configuration
- Performance optimization
- Custom image requests
Next Steps:
- Explore ML Environments for pre-built stacks
- Learn about Model Deployment for serving
- Read API Documentation for automation
- Browse Instance Types for CPU compute