From c48ae896a45e198af30d6f7d5b69de1638786342 Mon Sep 17 00:00:00 2001 From: Chris Rai Date: Wed, 4 Feb 2026 21:18:24 -0500 Subject: [PATCH] cost service: add platform base + per-VIN resource model with CPU/RAM display MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Updated cost model to show: (Platform Base) + (Per-VIN × VINs) - Platform base: 176 cores / 896GB RAM (Kafka, ClickHouse, MongoDB, Redis, PostgreSQL, gateway, monitoring) - Per-VIN marginal: 50mc / 82MB per vehicle - Added RESOURCE USAGE MODEL and COST FORMULA sections to report - Added CPU (mc) and RAM (MB) columns to TOP COST VEHICLES table - Updated README with new report output - virtual-vehicle: documented Vault cert TTL error troubleshooting --- deploy/cec-prd-cluster/cost.yaml | 4 +- services/cost/README.md | 126 ++++++++++++++++------------ services/cost/handlers/handlers.go | 26 +++++- services/cost/services/collector.go | 76 +++++++++++------ services/virtual-vehicle/README.md | 24 ++++++ 5 files changed, 172 insertions(+), 84 deletions(-) diff --git a/deploy/cec-prd-cluster/cost.yaml b/deploy/cec-prd-cluster/cost.yaml index 13c6c9d..af93b00 100644 --- a/deploy/cec-prd-cluster/cost.yaml +++ b/deploy/cec-prd-cluster/cost.yaml @@ -17,8 +17,8 @@ spec: spec: containers: - name: cost - image: fiskercloud.azurecr.io/cost:v7 - imagePullPolicy: Always + image: fiskercloud.azurecr.io/cost:v11 + imagePullPolicy: IfNotPresent ports: - containerPort: 8077 name: http diff --git a/services/cost/README.md b/services/cost/README.md index 4de3ca7..b2a6638 100644 --- a/services/cost/README.md +++ b/services/cost/README.md @@ -12,60 +12,68 @@ This service estimates the cost of running cloud services per VIN by: ## Cost Estimation Methodology -### Resource Estimates Per Active VIN +### Cost Model -| Activity Level | Messages/15min | CPU (cores) | Memory (GB) | -|---------------|----------------|-------------|-------------| -| Low | < 100 | 0.80 | 1.20 | -| Medium | 100-1000 | 1.20 | 1.80 | -| High | > 1000 | 1.60 | 2.40 | +The cost model separates fixed platform costs from variable per-VIN costs: -These estimates account for the full data pipeline per vehicle: -- Data ingestion (MQTT/HTTP endpoints) -- Kafka message processing -- Stream processing and transformations -- ClickHouse storage and queries -- Redis caching -- MongoDB document storage -- API serving +``` +Total Cost = Platform Base Cost + (Per-VIN Cost × Number of VINs) + Managed Services +``` -### Base Infrastructure Costs +Whether you have 100 vehicles or 100,000, you still need Kafka, databases, and gateway services running. That's your platform base cost. Then each additional vehicle adds a small marginal cost on top. -Shared infrastructure costs are distributed across active vehicles each collection interval: +### Platform Base Resources (Fixed) -| Component | Cloud ($/15min) | On-Prem ($/15min) | -|-----------|-----------------|-------------------| -| Storage, Event Hubs, Defender, monitoring | $10.00 | $0.90 | +What it takes to run the cloud services platform: -### Cost Rates (per hour) +| Component | CPU (cores) | Memory (GB) | Notes | +|-----------|-------------|-------------|-------| +| Kafka brokers | 32 | 128 | 3-node cluster | +| ClickHouse | 64 | 256 | 3 shards for HA | +| MongoDB | 16 | 128 | Replica set | +| Redis | 16 | 128 | Cluster mode | +| PostgreSQL | 32 | 128 | Primary + replicas | +| Gateway services | 8 | 64 | API gateway, auth | +| Monitoring/logging | 8 | 64 | Prometheus, Grafana, Loki | +| **Total Platform Base** | **176** | **896** | | -| Resource | Cloud (Azure) | On-Prem | -|----------|---------------|---------| -| CPU/core | $0.30 | $0.02 | -| Memory/GB| $0.08 | $0.005 | +### Per-VIN Resources (Marginal) -#### Cloud Rates (Fudged Higher) -- Based on Azure D-series VM pricing + 50% managed services overhead -- Includes: AKS compute, managed Kafka (Event Hubs), CosmosDB, Azure Storage, networking, monitoring -- Intentionally conservative (higher) to show true cloud TCO +Incremental resources needed for each additional connected vehicle: -#### On-Prem Rates (Fudged Lower) -- Based on 3-year hardware amortization only -- Assumes: owned hardware, minimal ops overhead -- Intentionally optimistic (lower) to show on-prem savings -- Does NOT include: datacenter costs, staff, power, cooling, network, maintenance +| Activity Level | Messages/15min | CPU (millicores) | Memory (MB) | +|---------------|----------------|------------------|-------------| +| Low | < 100 | 50 | 80 | +| Medium | 100-1000 | 75 | 120 | +| High | > 1000 | 100 | 160 | + +### Cost Rates + +| Resource | Cloud (Azure) | On-Prem/Bare Metal | +|----------|---------------|-------------------| +| CPU/core-hour | $0.30 | $0.02 | +| Memory/GB-hour | $0.08 | $0.005 | +| Managed Services/15min | $10.00 | $2.50 | + +#### Why On-Prem is ~90% Cheaper +- **Platform base**: Same hardware, but cloud charges ~15x more for managed services +- **Per-VIN compute**: Cloud VMs cost ~15x more than amortized bare metal +- **Managed services**: Event Hubs, CosmosDB, etc. have significant markup vs self-hosted equivalents ### Savings Calculation ``` -Per-VIN Cost = (CPU_cores × rate + Memory_GB × rate) × hours + (base_infra / active_vins) -Cloud Cost = Per-VIN costs summed across fleet -On-Prem Cost = Same formula with on-prem rates +Platform Base Cost = (176 cores × rate + 896 GB × rate) × hours +Per-VIN Cost = (0.05 cores × rate + 0.08 GB × rate) × hours × activity_multiplier +Total Cost = Platform Base + (Per-VIN × VIN count) + Managed Services + +Cloud Cost = Total with cloud rates +On-Prem Cost = Total with on-prem rates Savings = Cloud Cost - On-Prem Cost Savings % = (Savings / Cloud Cost) × 100 ``` -Expected savings: **~85-88%** with on-prem hosting (hardware costs only). +Expected savings: **~90%** with on-prem/bare metal hosting. ### Projected Annual Costs (5000 vehicles) @@ -114,40 +122,50 @@ curl http://localhost:8077/cost/report ╔══════════════════════════════════════════════════════════════════╗ ║ COST SERVICE REPORT ║ ╠══════════════════════════════════════════════════════════════════╣ -║ Period: 2026-01-03 to 2026-02-03 +║ Period: 2026-01-05 to 2026-02-05 ╠══════════════════════════════════════════════════════════════════╣ ║ FLEET OVERVIEW ║ ║ ─────────────────────────────────────────────────────────────── ║ -║ Active Vehicles: 3143 -║ Cloud Cost: $5658.12 -║ On-Prem Cost: $389.98 -║ Savings: $5268.15 (93.1%) +║ Active Vehicles: 3229 +║ Cloud Cost: $9761.61 +║ On-Prem Cost: $677.88 +║ Savings: $9083.73 (93.1%) +╠══════════════════════════════════════════════════════════════════╣ +║ RESOURCE USAGE MODEL ║ +║ ─────────────────────────────────────────────────────────────── ║ +║ Platform Base: 176 cores / 896 GB RAM (fixed) +║ Per-VIN Marginal: 50 millicores / 82 MB RAM +║ Total Fleet: 337.5 cores / 1154.3 GB RAM +╠══════════════════════════════════════════════════════════════════╣ +║ COST FORMULA ║ +║ ─────────────────────────────────────────────────────────────── ║ +║ (Platform Base) + (Per-VIN × 3229 VINs) + Managed Services ╠══════════════════════════════════════════════════════════════════╣ ║ COST RATES ║ ║ ─────────────────────────────────────────────────────────────── ║ ║ Cloud: CPU $0.30/core-hr Memory $0.080/GB-hr ║ On-Prem: CPU $0.02/core-hr Memory $0.005/GB-hr -║ Base Infra: Cloud $10.00/15min On-Prem $0.90/15min +║ Base Infra: Cloud $10.00/15min On-Prem $2.50/15min ╠══════════════════════════════════════════════════════════════════╣ ║ ANNUAL PROJECTION (based on current usage) ║ ║ ─────────────────────────────────────────────────────────────── ║ -║ Cloud Annual: $67897.44 -║ On-Prem Annual: $4679.70 -║ Annual Savings: $63217.74 +║ Cloud Annual: $117139.28 +║ On-Prem Annual: $8134.50 +║ Annual Savings: $109004.77 ╚══════════════════════════════════════════════════════════════════╝ TOP COST VEHICLES: -VIN Cloud $ On-Prem $ Savings % -─────────────────── ────────── ────────── ──────── -VCF1EBU27PG008191 6.84 0.48 93.0% -VCF1ZBU28PG005207 6.84 0.48 93.0% -VCF1ZBU29PG005488 6.84 0.48 93.0% -VCF1ZBU26PG004962 6.76 0.48 93.0% -VCF1EBU20PG008145 6.75 0.47 93.0% +VIN CPU (mc) RAM (MB) Cloud $ On-Prem $ Savings % +─────────────────── ──────── ──────── ────────── ────────── ──────── +VCF1UBU21PG008884 100 164 14.10 1.04 92.6% +VCF1EBU24PG007242 100 164 14.10 1.04 92.6% +VCF1ZBU29PG006267 100 164 14.10 1.04 92.6% +VCF1EBU26PG007307 75 123 13.41 0.99 92.6% +VCF1EBU22PG011967 50 82 12.77 0.93 92.8% ... ``` -*Note: Report generated 2026-02-03. Costs accumulate over time as the collector runs every 15 minutes.* +*Note: Report generated 2026-02-05. Costs accumulate over time as the collector runs every 15 minutes.* ## Configuration diff --git a/services/cost/handlers/handlers.go b/services/cost/handlers/handlers.go index 0f5f304..27132e9 100644 --- a/services/cost/handlers/handlers.go +++ b/services/cost/handlers/handlers.go @@ -220,6 +220,16 @@ func GetReport(w http.ResponseWriter, r *http.Request) { ║ On-Prem Cost: $%.2f ║ Savings: $%.2f (%.1f%%) ╠══════════════════════════════════════════════════════════════════╣ +║ RESOURCE USAGE MODEL ║ +║ ─────────────────────────────────────────────────────────────── ║ +║ Platform Base: %.0f cores / %.0f GB RAM (fixed) +║ Per-VIN Marginal: %.0f millicores / %.0f MB RAM +║ Total Fleet: %.1f cores / %.1f GB RAM +╠══════════════════════════════════════════════════════════════════╣ +║ COST FORMULA ║ +║ ─────────────────────────────────────────────────────────────── ║ +║ (Platform Base) + (Per-VIN × %d VINs) + Managed Services +╠══════════════════════════════════════════════════════════════════╣ ║ COST RATES ║ ║ ─────────────────────────────────────────────────────────────── ║ ║ Cloud: CPU $%.2f/core-hr Memory $%.3f/GB-hr @@ -237,12 +247,20 @@ func GetReport(w http.ResponseWriter, r *http.Request) { annualOnprem := summary.TotalOnpremCost * 12 annualSavings := annualCloud - annualOnprem + // Calculate total fleet resources + totalCPU := services.PlatformBaseCPUCores + (services.PerVinCPUCores * float64(summary.VehicleCount)) + totalRAM := services.PlatformBaseMemoryGB + (services.PerVinMemoryGB * float64(summary.VehicleCount)) + fmt.Fprintf(w, report, from.Format("2006-01-02"), to.Format("2006-01-02"), summary.VehicleCount, summary.TotalCloudCost, summary.TotalOnpremCost, summary.TotalSavings, summary.SavingsPercent, + services.PlatformBaseCPUCores, services.PlatformBaseMemoryGB, + services.PerVinCPUCores*1000, services.PerVinMemoryGB*1024, + totalCPU, totalRAM, + summary.VehicleCount, services.CloudCPUPerCoreHour, services.CloudMemoryPerGBHour, services.OnpremCPUPerCoreHour, services.OnpremMemoryPerGBHour, services.BaseInfraCloudCost, services.BaseInfraOnpremCost, @@ -252,11 +270,11 @@ func GetReport(w http.ResponseWriter, r *http.Request) { // Add top cost VINs if any if len(summary.TopCostVins) > 0 { fmt.Fprintf(w, "\nTOP COST VEHICLES:\n") - fmt.Fprintf(w, "%-20s %12s %12s %10s\n", "VIN", "Cloud $", "On-Prem $", "Savings %") - fmt.Fprintf(w, "%-20s %12s %12s %10s\n", "───────────────────", "──────────", "──────────", "────────") + fmt.Fprintf(w, "%-20s %10s %10s %12s %12s %10s\n", "VIN", "CPU (mc)", "RAM (MB)", "Cloud $", "On-Prem $", "Savings %") + fmt.Fprintf(w, "%-20s %10s %10s %12s %12s %10s\n", "───────────────────", "────────", "────────", "──────────", "──────────", "────────") for _, v := range summary.TopCostVins { - fmt.Fprintf(w, "%-20s %12.2f %12.2f %9.1f%%\n", - truncateVIN(v.VIN), v.TotalCloudCost, v.TotalOnpremCost, v.SavingsPercent) + fmt.Fprintf(w, "%-20s %10.0f %10.0f %12.2f %12.2f %9.1f%%\n", + truncateVIN(v.VIN), v.AvgCPUCores*1000, v.AvgMemoryGB*1024, v.TotalCloudCost, v.TotalOnpremCost, v.SavingsPercent) } } } diff --git a/services/cost/services/collector.go b/services/cost/services/collector.go index b8d96e9..9ce8139 100644 --- a/services/cost/services/collector.go +++ b/services/cost/services/collector.go @@ -13,23 +13,29 @@ const ( CloudCPUPerCoreHour = 0.30 // $/core/hour (Azure D-series + managed services + support) CloudMemoryPerGBHour = 0.08 // $/GB/hour (includes managed DB, Redis, caching layers) - // On-prem costs (amortized hardware only) + // On-prem/bare metal costs (amortized hardware only) // Assumes: 3-year hardware amortization, minimal ops overhead // Does NOT include: datacenter, power, cooling, staff, network OnpremCPUPerCoreHour = 0.02 // $/core/hour OnpremMemoryPerGBHour = 0.005 // $/GB/hour - // Estimated resource usage per active VIN - // Connected vehicle telemetry pipeline: ingestion → Kafka → processing → storage → APIs - // Each active VIN requires dedicated processing capacity across the stack - EstimatedCPUPerVin = 0.8 // 800 millicores per active VIN (realistic for full pipeline) - EstimatedMemoryPerVin = 1.2 // 1.2GB per active VIN (buffers, state, caches, connections) + // Per-VIN resource usage (marginal cost per vehicle) + // This is the incremental CPU/RAM needed for each additional connected vehicle + // Covers: telemetry ingestion, Kafka processing, storage writes, API queries + PerVinCPUCores = 0.05 // 50 millicores per VIN (marginal) + PerVinMemoryGB = 0.08 // 80MB per VIN (marginal) - // Base infrastructure cost per collection interval (shared services) - // Storage, Event Hubs, Defender, other fixed costs - // ~$45k/month fixed = ~$10/15min - BaseInfraCloudCost = 10.00 // $/15min for shared infra (cloud) - BaseInfraOnpremCost = 0.90 // $/15min for shared infra (on-prem) + // Platform base resources (fixed cost regardless of VIN count) + // This is the minimum infrastructure to run the cloud services platform: + // Kafka brokers, ClickHouse, MongoDB, Redis, PostgreSQL, gateway services, etc. + PlatformBaseCPUCores = 176.0 // 176 cores for base platform + PlatformBaseMemoryGB = 896.0 // 896GB RAM for base platform + + // Base infrastructure cost per collection interval (managed services, storage, networking) + // Cloud: Event Hubs, CosmosDB, Azure Storage, Defender, monitoring (~$45k/month) + // On-prem: ~75% cheaper - just hardware amortization for equivalent capacity + BaseInfraCloudCost = 10.00 // $/15min for managed services (cloud) + BaseInfraOnpremCost = 2.50 // $/15min for equivalent on-prem (75% cheaper) ) // CalculateCosts computes cloud and on-prem costs for given resource usage @@ -43,6 +49,26 @@ func CalculateCosts(cpuCores, memoryGB float64, durationHours float64) (cloudCos return } +// CalculatePlatformBaseCosts returns the fixed platform infrastructure costs +// This is the cost to run the platform regardless of VIN count +func CalculatePlatformBaseCosts(durationHours float64) (cloudCost, onpremCost float64) { + cloudCost = (PlatformBaseCPUCores * CloudCPUPerCoreHour * durationHours) + + (PlatformBaseMemoryGB * CloudMemoryPerGBHour * durationHours) + + onpremCost = (PlatformBaseCPUCores * OnpremCPUPerCoreHour * durationHours) + + (PlatformBaseMemoryGB * OnpremMemoryPerGBHour * durationHours) + + return +} + +// CalculatePerVinCosts returns the marginal cost for a single VIN +func CalculatePerVinCosts(durationHours float64, activityMultiplier float64) (cpuCores, memoryGB, cloudCost, onpremCost float64) { + cpuCores = PerVinCPUCores * activityMultiplier + memoryGB = PerVinMemoryGB * activityMultiplier + cloudCost, onpremCost = CalculateCosts(cpuCores, memoryGB, durationHours) + return +} + // CalculateBaseCosts returns the shared infrastructure costs for a collection interval func CalculateBaseCosts() (cloudCost, onpremCost float64) { return BaseInfraCloudCost, BaseInfraOnpremCost @@ -79,12 +105,15 @@ func collectMetrics() { return } - // Convert to cost records - // Estimate resource usage based on message activity durationHours := 0.25 // 15 minutes = 0.25 hours records := make([]CostRecord, 0, len(activeVins)) - // Get base infrastructure costs and distribute across active VINs + // Calculate platform base costs (distributed across all VINs for reporting) + platformCloudCost, platformOnpremCost := CalculatePlatformBaseCosts(durationHours) + platformCloudPerVin := platformCloudCost / float64(len(activeVins)) + platformOnpremPerVin := platformOnpremCost / float64(len(activeVins)) + + // Get managed services base costs (also distributed) baseCloud, baseOnprem := CalculateBaseCosts() baseCloudPerVin := baseCloud / float64(len(activeVins)) baseOnpremPerVin := baseOnprem / float64(len(activeVins)) @@ -98,22 +127,20 @@ func collectMetrics() { activityMultiplier = 1.5 } - cpuCores := EstimatedCPUPerVin * activityMultiplier - memoryGB := EstimatedMemoryPerVin * activityMultiplier + // Calculate per-VIN marginal costs + cpuCores, memoryGB, vinCloudCost, vinOnpremCost := CalculatePerVinCosts(durationHours, activityMultiplier) - cloudCost, onpremCost := CalculateCosts(cpuCores, memoryGB, durationHours) - - // Add share of base infrastructure costs - cloudCost += baseCloudPerVin - onpremCost += baseOnpremPerVin + // Total cost = per-VIN marginal + share of platform base + share of managed services + totalCloudCost := vinCloudCost + platformCloudPerVin + baseCloudPerVin + totalOnpremCost := vinOnpremCost + platformOnpremPerVin + baseOnpremPerVin records = append(records, CostRecord{ VIN: v.VIN, Timestamp: to, CPUCores: cpuCores, MemoryGB: memoryGB, - CloudCostUSD: cloudCost, - OnpremCostUSD: onpremCost, + CloudCostUSD: totalCloudCost, + OnpremCostUSD: totalOnpremCost, PodCount: 1, }) } @@ -124,5 +151,6 @@ func collectMetrics() { return } - logger.Info().Msgf("Collected and stored cost metrics for %d VINs", len(records)) + logger.Info().Msgf("Collected and stored cost metrics for %d VINs (platform base: cloud $%.2f, on-prem $%.2f)", + len(records), platformCloudCost, platformOnpremCost) } diff --git a/services/virtual-vehicle/README.md b/services/virtual-vehicle/README.md index ad0ca4a..5daf302 100644 --- a/services/virtual-vehicle/README.md +++ b/services/virtual-vehicle/README.md @@ -45,3 +45,27 @@ This service simulates a connected vehicle (T.Rex) by: | 0x200 | Door status | | 0x201 | Light status | | 0x300 | HVAC | + + +## Troubleshooting + +### 503 Error - Vault Certificate TTL Issue + +If the service returns a 503 and logs show a Vault error like: + +```json +{"level":"error","message":"manufacturer API returned 503: {\"message\":\"Vault unable to create certificate, status code: 400 {\\\"errors\\\":[\\\"cannot satisfy request, as TTL would result in notAfter 2034-02-01T02:34:43.379661987Z that is beyond the expiration of the CA certificate at 2032-04-10T20:21:59Z\\\"]}\\n\",\"error\":\"Service Unavailable\"}"} +``` + +This happens when the certificate TTL requested (8 years) would extend beyond the CA certificate's expiration date. The manufacturer service requests a cert that expires in 2034, but the Vault CA expires in 2032. + +**Fix options:** + +1. **Reduce the PKI role TTL in Vault** *(non-disruptive - only affects new certs)*: + ```bash + vault write pki/roles/vehicle-cert ttl=17520h max_ttl=17520h # 2 years + ``` +2. **Extend the CA certificate** - Regenerate the Vault PKI CA with a longer validity period *(disruptive - all existing certs need reissue)* +3. **Rotate the CA** - Create a new CA with extended validity *(disruptive - requires cert reissue)* + +On dev-cluster, the CA was created with a ~10 year validity but the cert TTL request pushes past that.