Architectural decisions with rationale, so future sessions don't re-litigate them.
Decision: Custom REST API + S3. Do not use iCloud Drive, CloudKit, or any platform cloud as the sync backend.
iCloud Drive was evaluated and rejected for product use. The core issue: you lose control of the most user-visible parts of the experience.
| What you lose | Impact |
|---|---|
| Sync timing | iCloud decides when to push. A note saved on iPhone may take 30sā5min to appear on Mac. Cannot be influenced. |
| Ordering | iCloud doesn't know that two files are related. Linked notes can arrive out of order. |
| Error states | "iCloud Sync Paused" is Apple's error, not ours. Cannot intercept, explain, or recover gracefully. |
| Conflict timing | Conflicts surface after both devices have diverged, never proactively. |
| Telemetry | Cannot see why sync is slow for a specific user. Debugging is blind. |
| Cross-platform | iCloud Drive has no Android story. Any future Android support would require a full re-architecture. |
iCloud Drive is acceptable as an optional data portability layer (letting users mirror their vault in the Files app), but it must never be the canonical sync backend.
The data is small text files. A heavy user with 1,000 notes is ~50MB in S3 ā effectively free ($0.001/month in storage). The API is a thin coordination layer; it never touches file bytes.
What we gain:
| Control point | iCloud Drive | Custom S3 |
|---|---|---|
| Push "sync now" to other devices | No | Yes ā APNs/FCM on confirm |
| Near real-time sync | No | Yes ā push-triggered, seconds |
| Sync telemetry per user | No | Yes |
| Graceful degradation UX | No | Yes ā we control the error states |
| Cross-platform (Android, web) | No | Yes ā same protocol |
| E2E encryption, user controls keys | Partial | Yes ā API/S3 never read file contents |
| Data export (download as zip) | Files app hack | First-class GET /vault/export |
| Conflict detection | After-the-fact | Version-checked on every push |
Decision: Rust with Axum. Not Go, not Kotlin/Ktor, not TypeScript.
This API validates JWTs, queries Postgres, generates presigned URLs, and dispatches push notifications. It never reads or writes vault file bytes. The bottleneck is always network/DB round-trips.
| Factor | Rust (Axum) | Go | Kotlin (Ktor) |
|---|---|---|---|
| Memory per instance | ~10ā30 MB | ~30ā50 MB | ~200ā400 MB (JVM) |
| Cold start (Lambda) | ~50 ms | ~80 ms | ~2ā5 s |
| Tail latency (p99) | Best | Very good | Good |
| Binary size | ~5 MB single binary | ~10 MB single binary | ~50 MB+ fat JAR |
| Monthly hosting cost at scale | Lowest | Low | 3ā5Ć more (RAM) |
Kotlin/Ktor was specifically rejected despite the KMP model-sharing benefit. The JVM memory footprint means paying 3ā5Ć more per instance. For a thin coordination API, sharing ~100 lines of data models is not worth the operational cost.
Go was the pragmatic alternative. If Rust proves too slow to iterate on, Go is the fallback ā similar operational profile, faster to write.
The API surface is small: ~5 core endpoints, auth middleware, S3 presigned URL generation, push notification dispatch. This is not a complex domain ā maybe 1,500ā2,000 lines of actual logic. The Axum + sqlx + aws-sdk-rust stack is mature and well-documented for exactly this pattern.
The operational savings (smaller instances, faster cold starts on Lambda, lower tail latency) compound over the lifetime of the product.
Stack: Rust 2024 edition, Axum 0.8, sqlx (Postgres, compile-time checked queries), aws-sdk-s3, jsonwebtoken, tower middleware.
Decision: AWS. Not Azure, not GCP.
The architecture is designed around S3. S3 presigned URLs are native; using Azure Blob Storage would require rewriting the presigned URL logic for no benefit.
AWS services used:
| Service | Purpose |
|---|---|
| **Lambda** | API hosting (pay-per-request, $0 at idle) |
| **API Gateway** | HTTPS endpoint in front of Lambda |
| **RDS Postgres** | User, device, and file version state |
| **S3** | Vault file storage (steel-notes-vaults/ bucket) |
| **SNS ā APNs/FCM** | Push notification fan-out to devices |
Infrastructure is defined in Terraform at server/deploy/terraform/.
Decision: RDS t4g.nano to start.
t4g.nano = 0.5 GB RAM, 2 vCPU burstable, ARM64 (Graviton2)Lambda + API Gateway handles all compute ā no EC2 or Fargate instances to size at launch.
Decision: Lambda as the primary deployment target. ECS Fargate as a future fallback if Lambda p99 latency becomes a problem at scale.
Rust's binary (~5 MB) and cold start (~50 ms) make Lambda viable in a way that JVM-based languages are not. cargo-lambda handles the build and deploy pipeline. The binary runs on ECS Fargate with zero code changes if needed.
Decision: Single repo for now. Backend lives in server/ alongside shared/, cli/, iosApp/.
Rationale: moving fast matters more than clean repo boundaries at this stage. Splitting into separate repos adds CI/CD complexity, versioning overhead, and friction for cross-cutting changes (e.g., a sync protocol change touches both shared/ and server/).
Revisit when: the team grows and backend/client development cadences diverge significantly.
```
steel-notes/
āāā shared/ # KMP shared Kotlin module
āāā cli/ # Kotlin/Native macOS CLI
āāā iosApp/ # SwiftUI iOS app
āāā server/ # Rust/Axum backend (AWS Lambda)
ā āāā src/
ā āāā Cargo.toml
ā āāā Dockerfile
ā āāā deploy/terraform/
āāā docs/ # Architecture docs
```
1. Pricing model ā free tier limits + paid subscription? Storage cost per user is negligible (~$0.001/month for text). Real costs are Lambda invocations and push notifications. Decide before Phase 3E.
2. Cloudflare R2 vs AWS S3 ā R2 has no egress fees (S3 charges $0.09/GB out). For a text-heavy app with small files, egress cost is negligible now but worth revisiting if attachments (images, PDFs) are added. S3 chosen for ecosystem simplicity; R2 is a drop-in swap if egress becomes meaningful.
3. E2E encryption ā architecture supports it (API and S3 never read file contents), but key management UX is complex. Deferred post-launch or as a premium feature.
4. Lambda ā Fargate migration trigger ā if p99 API latency from Lambda cold starts exceeds 500 ms at scale, migrate to Fargate minimum-1-task. No code changes required.