ML Environments
Pre-configured machine learning stacks with PyTorch, TensorFlow, Jupyter, and more
ML Environments
Fugoku provides pre-configured machine learning environments for rapid AI/ML development and training.
Pre-Built Images
All ML images include:
- Latest framework versions
- CUDA 12.x + cuDNN
- Jupyter Lab with extensions
- Common ML libraries (numpy, pandas, scikit-learn, matplotlib)
- GPU-optimized builds
- Pre-warmed caches for faster startup
PyTorch
Image: pytorch-2.0-cuda12
Includes:
- PyTorch 2.0.1 with CUDA 12.1
- torchvision 0.15.2
- torchaudio 2.0.2
- Transformers (Hugging Face) 4.30.0
- Accelerate 0.20.0
- NVIDIA Apex (mixed precision)
- Flash Attention 2
- bitsandbytes (quantization)
- Jupyter Lab 4.0
- tensorboard
- wandb (Weights & Biases)
Quick start:
fugoku create instance \
--name pytorch-dev \
--plan gpu-a100-1 \
--image pytorch-2.0-cuda12 \
--region lagos-1
fugoku ssh pytorch-dev
python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
# CUDA available: TrueTensorFlow
Image: tensorflow-2.13-cuda12
Includes:
- TensorFlow 2.13 with CUDA 12.1
- Keras 2.13
- TensorBoard 2.13
- TensorFlow Datasets
- TF-Agents (RL)
- TensorFlow Probability
- Jupyter Lab 4.0
- wandb
Quick start:
fugoku create instance \
--name tf-dev \
--plan gpu-a100-1 \
--image tensorflow-2.13-cuda12 \
--region lagos-1
fugoku ssh tf-dev
python3 -c "import tensorflow as tf; print(f'GPUs: {len(tf.config.list_physical_devices(\"GPU\"))}')"
# GPUs: 1JAX
Image: jax-0.4-cuda12
Includes:
- JAX 0.4.13 with CUDA 12.1
- Flax (neural networks)
- Optax (optimization)
- Chex (testing utilities)
- Orbax (checkpointing)
- Jupyter Lab 4.0
Quick start:
fugoku create instance \
--name jax-dev \
--plan gpu-a100-1 \
--image jax-0.4-cuda12 \
--region lagos-1
fugoku ssh jax-dev
python3 -c "import jax; print(f'Devices: {jax.devices()}')"
# Devices: [GpuDevice(id=0)]Multi-Framework
Image: ml-all-cuda12
Includes:
- PyTorch 2.0.1
- TensorFlow 2.13
- JAX 0.4.13
- Jupyter Lab 4.0
- All libraries from above
Use case: Experimentation across frameworks, model conversion.
Size: ~15 GB (vs ~5 GB per single-framework image)
Jupyter Lab
All ML images include Jupyter Lab pre-installed.
Starting Jupyter
# SSH into instance
fugoku ssh pytorch-dev
# Start Jupyter
jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root
# Output includes token URL:
# http://0.0.0.0:8888/lab?token=abc123...Access from Local Machine
SSH tunnel:
# On your laptop
ssh -L 8888:localhost:8888 ubuntu@<instance-ip>
# Open in browser
open http://localhost:8888/lab?token=abc123...Or set password:
# On instance
jupyter lab password
# Enter password (twice)
# Start Jupyter (no token needed)
jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root
# Access via: http://<instance-ip>:8888Security note: If exposing Jupyter publicly, use HTTPS and strong password. Or use SSH tunnel (recommended).
Jupyter Extensions
Pre-installed extensions:
- jupyterlab-git (Git integration)
- jupyterlab-lsp (code intelligence)
- jupyterlab-execute-time
- nbdime (notebook diff)
- jupyterlab-tensorboard
TensorBoard in Jupyter
# In Jupyter notebook
%load_ext tensorboard
%tensorboard --logdir ./logsExample Workflows
Fine-Tune BERT (PyTorch + Transformers)
# Create instance
fugoku create instance \
--name bert-finetune \
--plan gpu-a100-1 \
--image pytorch-2.0-cuda12 \
--region lagos-1 \
--wait
# SSH in
fugoku ssh bert-finetune
# Create training script
cat > train.py << 'EOF'
from transformers import BertForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
# Load dataset
dataset = load_dataset("imdb")
# Load model
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
# Training args
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
warmup_steps=500,
weight_decay=0.01,
logging_dir="./logs",
logging_steps=10,
fp16=True, # Mixed precision
)
# Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
)
# Train
trainer.train()
EOF
# Run training
python train.pyImage Classification (TensorFlow)
# Create instance
fugoku create instance \
--name image-classifier \
--plan gpu-a100-1 \
--image tensorflow-2.13-cuda12 \
--region lagos-1 \
--wait
fugoku ssh image-classifier
# Training script
cat > train.py << 'EOF'
import tensorflow as tf
from tensorflow.keras import layers, models
# Load CIFAR-10
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
# Build model
model = models.Sequential([
layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.Flatten(),
layers.Dense(64, activation='relu'),
layers.Dense(10)
])
model.compile(
optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy']
)
# Train
history = model.fit(
x_train, y_train,
epochs=10,
validation_data=(x_test, y_test),
batch_size=64
)
# Save model
model.save('cifar10_model.h5')
EOF
python train.pyDistributed Training (PyTorch DDP)
# Create 4-GPU instance
fugoku create instance \
--name distributed-train \
--plan gpu-a100-4 \
--image pytorch-2.0-cuda12 \
--region lagos-1 \
--wait
fugoku ssh distributed-train
# Distributed training script
cat > train_ddp.py << 'EOF'
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP
def setup(rank, world_size):
dist.init_process_group("nccl", rank=rank, world_size=world_size)
def cleanup():
dist.destroy_process_group()
def train(rank, world_size):
setup(rank, world_size)
# Model
model = YourModel().to(rank)
ddp_model = DDP(model, device_ids=[rank])
# Training loop
for epoch in range(num_epochs):
for batch in dataloader:
batch = batch.to(rank)
outputs = ddp_model(batch)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
optimizer.zero_grad()
cleanup()
if __name__ == "__main__":
world_size = 4 # 4 GPUs
mp.spawn(train, args=(world_size,), nprocs=world_size, join=True)
EOF
# Run on all 4 GPUs
python train_ddp.pyData Management
Datasets
Hugging Face Datasets:
from datasets import load_dataset
# Auto-cached to ~/.cache/huggingface
dataset = load_dataset("squad")TensorFlow Datasets:
import tensorflow_datasets as tfds
# Auto-cached to ~/tensorflow_datasets
dataset = tfds.load("imagenet2012", split="train")Custom datasets: Store on attached block volumes:
# Create 2TB volume for datasets
fugoku volumes create \
--name datasets \
--size 2000 \
--region lagos-1
# Attach to instance
fugoku volumes attach datasets --instance pytorch-dev
# Mount
sudo mkfs.ext4 /dev/vdb
sudo mkdir /data
sudo mount /dev/vdb /data
sudo chown ubuntu:ubuntu /data
# Symlink common paths
ln -s /data ~/.cache/huggingface
ln -s /data ~/tensorflow_datasetsModel Checkpoints
Save to volume:
# PyTorch
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}, '/data/checkpoints/model_epoch_{}.pth'.format(epoch))
# TensorFlow
model.save('/data/checkpoints/model_epoch_{}'.format(epoch))Auto-checkpoint with Transformers:
training_args = TrainingArguments(
output_dir="/data/checkpoints",
save_steps=1000,
save_total_limit=3, # Keep only 3 latest
)Experiment Tracking
Weights & Biases
Pre-installed in all ML images.
import wandb
# Initialize
wandb.init(project="my-project", name="experiment-1")
# Log metrics
wandb.log({"loss": loss, "accuracy": acc})
# Log model
wandb.save("model.pth")TensorBoard
# PyTorch
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter('runs/experiment_1')
writer.add_scalar('Loss/train', loss, epoch)
writer.add_scalar('Accuracy/train', acc, epoch)
writer.close()# TensorFlow
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir='./logs',
histogram_freq=1
)
model.fit(x_train, y_train, callbacks=[tensorboard_callback])View TensorBoard:
tensorboard --logdir ./logs --host 0.0.0.0 --port 6006Access via SSH tunnel or public IP.
Environment Customization
Install Additional Packages
# Python packages
pip install transformers datasets accelerate
# System packages
sudo apt update
sudo apt install ffmpeg libsm6 libxext6 -y
# Conda packages
conda install -c conda-forge lightgbmCreate Custom Image
Save your configured environment:
# On configured instance
fugoku snapshots create pytorch-dev --name my-custom-ml-env
# Create new instance from snapshot
fugoku create instance \
--name new-dev \
--plan gpu-a100-1 \
--region lagos-1 \
--snapshot my-custom-ml-envDocker Containers
Run ML frameworks in Docker:
# Install Docker
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker ubuntu
# NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update && sudo apt install -y nvidia-docker2
sudo systemctl restart docker
# Run PyTorch container
docker run --gpus all -it --rm pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtimeBest Practices
Cost Optimization
Stop instances when not training:
# Stop (keeps data, stops billing for compute)
fugoku instances stop pytorch-dev
# Resume later
fugoku instances start pytorch-devUse smaller instances for development:
# Dev work on CPU instance
fugoku create instance --name jupyter-dev --plan standard-2 --image pytorch-2.0-cuda12
# Train on GPU
fugoku create instance --name training --plan gpu-a100-1 --image pytorch-2.0-cuda12Spot instances (coming Q4 2026): Up to 70% cheaper for fault-tolerant training.
Data Management
- Store datasets on volumes - Separate from instance lifecycle
- Cache downloads - Use ~/.cache symlinked to volume
- Checkpoint frequently - Every N steps, not just epochs
- Use compression - For large checkpoint files
Security
- Don't expose Jupyter publicly - Use SSH tunnels or VPN
- Set Jupyter password - Never use default token
- Firewall GPU instances - Restrict SSH to your IP
- Use private networks - For multi-instance training
Performance
- Pin memory - In PyTorch DataLoaders
- Use mixed precision - FP16 for 2x speedup
- Optimize batch size - Maximize GPU utilization
- Profile your code - Find bottlenecks with PyTorch Profiler / TF Profiler
- Multi-GPU training - Use DDP for scaling
Troubleshooting
CUDA out of memory
# Reduce batch size
batch_size = 16 # was 32
# Enable gradient accumulation
accumulation_steps = 4
# Clear cache
torch.cuda.empty_cache()
# Enable gradient checkpointing (slower but uses less memory)
model.gradient_checkpointing_enable()Jupyter not accessible
# Check if running
jupyter lab list
# Check firewall
sudo ufw status
# Restart Jupyter
jupyter lab stop
jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-rootSlow training
# Check GPU utilization
nvidia-smi -l 1
# Should be 90-100%
# If low, check data loading
# Profile with:
python -m torch.utils.bottleneck train.pySupport
ML environment issues: support@fugoku.com
Community: Discord #ml-ai channel
Documentation: docs.fugoku.com/ml-environments
Next Steps:
- Deploy Models for inference
- Learn about GPU Compute for specifications
- Explore the CLI for automation
- Read about Storage for datasets