NVIDIA H100 and A100 GPUs for AI/ML training, inference, and high-performance computing

GPU Compute

Fugoku offers bare-metal NVIDIA H100 and A100 GPUs for AI/ML training, inference, rendering, and compute-intensive workloads.

GPU Plans

All GPU instances include:

Bare-metal GPU access (no virtualization overhead)
CUDA 12.x pre-installed
cuDNN libraries
NVIDIA drivers (latest stable)
High-bandwidth CPU-GPU interconnect
NVMe SSD storage
10 Gbps network

NVIDIA A100 Plans

Professional GPU for training and inference.

Plan	GPU	VRAM	vCPU	RAM	Storage	Price/hr	Price/mo*
gpu-a100-1	1x A100 40GB	40 GB	12	64 GB	500 GB	$2.50	$1,800
gpu-a100-2	2x A100 40GB	80 GB	24	128 GB	1 TB	$5.00	$3,600
gpu-a100-4	4x A100 40GB	160 GB	48	256 GB	2 TB	$10.00	$7,200
gpu-a100-8	8x A100 40GB	320 GB	96	512 GB	4 TB	$20.00	$14,400

*Based on 24/7 usage (720 hrs/month)

Specs:

Architecture: NVIDIA Ampere
FP32: 19.5 TFLOPS
FP16 (Tensor Cores): 312 TFLOPS
Memory Bandwidth: 1.6 TB/s
NVLink: 600 GB/s (multi-GPU)

Use cases:

Large language model training (up to 30B parameters)
Computer vision model training
Reinforcement learning
Batch inference at scale
Scientific computing

NVIDIA H100 Plans

Flagship GPU for cutting-edge AI research.

Plan	GPU	VRAM	vCPU	RAM	Storage	Price/hr	Price/mo*
gpu-h100-1	1x H100 80GB	80 GB	16	128 GB	1 TB	$4.00	$2,880
gpu-h100-2	2x H100 80GB	160 GB	32	256 GB	2 TB	$8.00	$5,760
gpu-h100-4	4x H100 80GB	320 GB	64	512 GB	4 TB	$16.00	$11,520
gpu-h100-8	8x H100 80GB	640 GB	128	1 TB	8 TB	$32.00	$23,040

Specs:

Architecture: NVIDIA Hopper
FP32: 67 TFLOPS
FP16 (Tensor Cores): 1,979 TFLOPS
FP8 (Transformer Engine): 3,958 TFLOPS
Memory Bandwidth: 3.35 TB/s
NVLink 4.0: 900 GB/s (multi-GPU)

Use cases:

Large language models (100B+ parameters)
Diffusion models (Stable Diffusion, DALL-E)
Multi-modal models (CLIP, Flamingo)
Large-scale training with DeepSpeed/Megatron
High-throughput inference

Pricing Models

On-Demand (default):

Hourly billing, pay per second
No commitment
Full flexibility

Reserved (coming Q3 2026):

1-year or 3-year commitment
30-50% discount
Guaranteed availability

Spot (coming Q4 2026):

Spare capacity, up to 70% discount
Can be interrupted with 30-second warning
Great for fault-tolerant training jobs

Creating a GPU Instance

Via Console

Navigate to AI & ML → GPU Instances
Click Create GPU Instance
Select:
- GPU Type: A100 or H100
- Configuration: Number of GPUs
- Region: lagos-1, london-1, frankfurt-1
- Image: PyTorch, TensorFlow, or blank Ubuntu
- SSH Keys: For access
Click Create

Provisioning time: 2-5 minutes (longer than CPU instances).

Via CLI

# Single A100 with PyTorch
fugoku create instance \
  --name train-job-1 \
  --plan gpu-a100-1 \
  --image pytorch-2.0-cuda12 \
  --region lagos-1 \
  --ssh-key laptop

# 8x H100 for large-scale training
fugoku create instance \
  --name llm-training \
  --plan gpu-h100-8 \
  --image ubuntu-22.04-cuda12 \
  --region frankfurt-1 \
  --ssh-key laptop

Via API

curl -X POST https://api.fugoku.com/v1/instances \
  -H "Authorization: Bearer $FUGOKU_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "train-job-1",
    "plan": "gpu-a100-1",
    "image": "pytorch-2.0-cuda12",
    "region": "lagos-1",
    "ssh_keys": ["laptop"]
  }'

Pre-Configured Images

PyTorch

Image: pytorch-2.0-cuda12
Includes:
- PyTorch 2.0.1 with CUDA 12.1
- torchvision, torchaudio
- Transformers (Hugging Face)
- NVIDIA Apex
- Jupyter Lab
- Common ML libraries (numpy, pandas, scikit-learn)

# Launch and verify
fugoku ssh train-job-1
python3 -c "import torch; print(torch.cuda.is_available())"
# True
python3 -c "import torch; print(torch.cuda.get_device_name(0))"
# NVIDIA A100-SXM4-40GB

TensorFlow

Image: tensorflow-2.13-cuda12
Includes:
- TensorFlow 2.13 with CUDA 12.1
- Keras
- TensorBoard
- Jupyter Lab
- Common ML libraries

# Verify
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
# [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

JAX

Image: jax-0.4-cuda12
Includes:
- JAX 0.4.13
- Flax
- Optax
- Jupyter Lab

# Verify
python3 -c "import jax; print(jax.devices())"
# [GpuDevice(id=0)]

Blank CUDA

Image: ubuntu-22.04-cuda12
Includes:
- Ubuntu 22.04
- CUDA 12.1
- cuDNN 8.9
- NVIDIA drivers
- Build tools (gcc, make, cmake)

Perfect for custom environments or frameworks not pre-packaged.

Example Workflows

Training a Model with PyTorch

# Create GPU instance
fugoku create instance \
  --name pytorch-train \
  --plan gpu-a100-1 \
  --image pytorch-2.0-cuda12 \
  --region lagos-1

# SSH into instance
fugoku ssh pytorch-train

# Clone your training repo
git clone https://github.com/yourname/model-training.git
cd model-training

# Install additional dependencies
pip install -r requirements.txt

# Run training
python train.py \
  --batch-size 64 \
  --epochs 100 \
  --lr 0.001 \
  --gpu 0

# Monitor GPU usage
nvidia-smi -l 1

Distributed Training (Multi-GPU)

Using PyTorch DistributedDataParallel:

# train.py
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP

def setup(rank, world_size):
    dist.init_process_group("nccl", rank=rank, world_size=world_size)

def train(rank, world_size):
    setup(rank, world_size)

    model = YourModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    # Training loop
    for epoch in range(num_epochs):
        for batch in dataloader:
            outputs = ddp_model(batch)
            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()

    dist.destroy_process_group()

if __name__ == "__main__":
    world_size = 4  # 4 GPUs
    mp.spawn(train, args=(world_size,), nprocs=world_size, join=True)

Run on 4-GPU instance:

python train.py

Fine-Tuning with Hugging Face

# SSH into instance
fugoku ssh pytorch-train

# Install transformers
pip install transformers datasets accelerate

# Fine-tune BERT
python -m torch.distributed.launch \
  --nproc_per_node=1 \
  run_glue.py \
  --model_name_or_path bert-base-uncased \
  --task_name mnli \
  --do_train \
  --do_eval \
  --max_seq_length 128 \
  --per_device_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3 \
  --output_dir /tmp/mnli_output

Inference Server

Deploy trained model for inference:

# inference.py
import torch
from flask import Flask, request, jsonify

app = Flask(__name__)

# Load model
model = torch.load('model.pth')
model.eval()
model = model.cuda()

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json['data']
    tensor = torch.tensor(data).cuda()

    with torch.no_grad():
        output = model(tensor)

    return jsonify({'prediction': output.cpu().numpy().tolist()})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

Run server:

python inference.py

Test from client:

curl -X POST http://gpu-instance-ip:5000/predict \
  -H "Content-Type: application/json" \
  -d '{"data": [1.0, 2.0, 3.0]}'

Monitoring GPU Usage

nvidia-smi

Real-time GPU monitoring:

# One-time snapshot
nvidia-smi

# Continuous monitoring (1 second refresh)
nvidia-smi -l 1

# Compact format
nvidia-smi --query-gpu=index,name,temperature.gpu,utilization.gpu,utilization.memory,memory.used,memory.total --format=csv -l 1

Output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB   On   | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P0   120W / 400W |  32768MiB / 40960MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

Fugoku Console Metrics

Navigate to instance detail page → GPU Metrics tab:

GPU utilization (%)
Memory utilization (%)
Temperature (°C)
Power usage (W)
Graphs: 1h, 6h, 24h, 7d

Programmatic Monitoring

import subprocess
import json

def get_gpu_stats():
    result = subprocess.run([
        'nvidia-smi',
        '--query-gpu=index,utilization.gpu,memory.used,memory.total,temperature.gpu',
        '--format=csv,noheader,nounits'
    ], capture_output=True, text=True)

    lines = result.stdout.strip().split('\n')
    stats = []
    for line in lines:
        idx, util, mem_used, mem_total, temp = line.split(', ')
        stats.append({
            'gpu': int(idx),
            'utilization': float(util),
            'memory_used': int(mem_used),
            'memory_total': int(mem_total),
            'temperature': float(temp)
        })
    return stats

# Usage
print(json.dumps(get_gpu_stats(), indent=2))

Storage for ML Workflows

Root Disk

GPU instances include large root disks (500GB - 8TB).

For datasets larger than root disk, attach block volumes:

# Create 2 TB volume
fugoku volumes create \
  --name datasets \
  --size 2000 \
  --region lagos-1

# Attach to GPU instance
fugoku volumes attach datasets --instance pytorch-train

# Format and mount
ssh into instance:
sudo mkfs.ext4 /dev/vdb
sudo mkdir /data
sudo mount /dev/vdb /data
echo '/dev/vdb /data ext4 defaults 0 2' | sudo tee -a /etc/fstab

Object Storage (coming Q3 2026)

S3-compatible object storage for datasets and checkpoints:

import boto3

s3 = boto3.client('s3',
    endpoint_url='https://s3.fugoku.com',
    aws_access_key_id='your_key',
    aws_secret_access_key='your_secret'
)

# Upload checkpoints during training
s3.upload_file('checkpoint_epoch_10.pth', 'my-bucket', 'checkpoints/checkpoint_10.pth')

# Download datasets at training start
s3.download_file('my-bucket', 'datasets/imagenet.tar', 'imagenet.tar')

Jupyter Lab

PyTorch and TensorFlow images include Jupyter Lab pre-installed.

Start Jupyter:

# SSH into instance
fugoku ssh pytorch-train

# Start Jupyter (accessible from localhost:8888)
jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root

# Output:
# [I 2024-02-25 10:30:00.123 ServerApp] Jupyter Server is running at:
# [I 2024-02-25 10:30:00.123 ServerApp] http://0.0.0.0:8888/lab?token=abc123...

Access from your laptop:

# SSH tunnel
ssh -L 8888:localhost:8888 ubuntu@gpu-instance-ip

# Open in browser
open http://localhost:8888/lab?token=abc123...

Or set up password:

jupyter lab password
# Enter password twice

jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root

Best Practices

Cost Optimization

Stop instances when not training: GPU instances are expensive - stop when idle.

# Stop instance (keeps all data)
fugoku instances stop pytorch-train

# Resume later
fugoku instances start pytorch-train

Use spot instances (when available): Up to 70% cheaper for fault-tolerant jobs.

Checkpoint frequently: Save model checkpoints every N epochs - resume if spot instance interrupted.

# Save checkpoint every 10 epochs
if epoch % 10 == 0:
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': loss,
    }, f'checkpoint_epoch_{epoch}.pth')

Performance Optimization

Use mixed precision training: FP16 reduces memory usage and increases throughput.

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for batch in dataloader:
    optimizer.zero_grad()

    with autocast():
        outputs = model(batch)
        loss = criterion(outputs, targets)

    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Pin memory in DataLoader:

dataloader = DataLoader(
    dataset,
    batch_size=64,
    num_workers=4,
    pin_memory=True  # Faster CPU->GPU transfer
)

Use gradient accumulation: Simulate larger batch sizes without OOM:

accumulation_steps = 4

for i, batch in enumerate(dataloader):
    outputs = model(batch)
    loss = criterion(outputs, targets)
    loss = loss / accumulation_steps
    loss.backward()

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Multi-GPU Training

Data parallelism: Replicate model across GPUs, split data.

model = nn.DataParallel(model)

Distributed data parallelism (recommended): More efficient, use DistributedDataParallel.

Model parallelism: Split large models across GPUs (for models that don't fit in single GPU memory).

See PyTorch docs for details.

Troubleshooting

CUDA out of memory:

Reduce batch size
Enable gradient checkpointing
Use mixed precision
Clear cache: torch.cuda.empty_cache()

Slow training:

Check GPU utilization: nvidia-smi - should be 90-100%
Profile with PyTorch profiler
Check data loading bottleneck - use more workers
Verify disk I/O isn't limiting

Driver issues: Drivers are pre-installed and tested. If issues:

# Check driver
nvidia-smi

# Reinstall if needed (contact support first)
sudo apt install --reinstall nvidia-driver-525

Getting Help

GPU instance issues? We're here:

Email: support@fugoku.com (include instance ID)
Discord: #gpu-compute channel
Documentation: docs.fugoku.com/gpu

Common support requests:

Driver/CUDA version conflicts
Multi-GPU configuration
Performance optimization
Custom image requests

Next Steps:

Explore ML Environments for pre-built stacks
Learn about Model Deployment for serving
Read API Documentation for automation
Browse Instance Types for CPU compute

GPU Compute

On this page