cost service: add platform base + per-VIN resource model with CPU/RAM display

- Updated cost model to show: (Platform Base) + (Per-VIN × VINs) - Platform base: 176 cores / 896GB RAM (Kafka, ClickHouse, MongoDB, Redis, PostgreSQL, gateway, monitoring) - Per-VIN marginal: 50mc / 82MB per vehicle - Added RESOURCE USAGE MODEL and COST FORMULA sections to report - Added CPU (mc) and RAM (MB) columns to TOP COST VEHICLES table - Updated README with new report output - virtual-vehicle: documented Vault cert TTL error troubleshooting
2026-02-04 21:18:24 -05:00
parent 54b17cf126
commit c48ae896a4
5 changed files with 172 additions and 84 deletions
--- a/deploy/cec-prd-cluster/cost.yaml
+++ b/deploy/cec-prd-cluster/cost.yaml
@@ -17,8 +17,8 @@ spec:
    spec:
      containers:
        - name: cost
-          image: fiskercloud.azurecr.io/cost:v7
-          imagePullPolicy: Always
+          image: fiskercloud.azurecr.io/cost:v11
+          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 8077
              name: http
--- a/services/cost/README.md
+++ b/services/cost/README.md
@@ -12,60 +12,68 @@ This service estimates the cost of running cloud services per VIN by:

 ## Cost Estimation Methodology

-### Resource Estimates Per Active VIN
+### Cost Model

-| Activity Level | Messages/15min | CPU (cores) | Memory (GB) |
-|---------------|----------------|-------------|-------------|
-| Low           | < 100          | 0.80        | 1.20        |
-| Medium        | 100-1000       | 1.20        | 1.80        |
-| High          | > 1000         | 1.60        | 2.40        |
+The cost model separates fixed platform costs from variable per-VIN costs:

-These estimates account for the full data pipeline per vehicle:
- Data ingestion (MQTT/HTTP endpoints)
- Kafka message processing
- Stream processing and transformations
- ClickHouse storage and queries
- Redis caching
- MongoDB document storage
- API serving
+```
+Total Cost = Platform Base Cost + (Per-VIN Cost × Number of VINs) + Managed Services
+```

-### Base Infrastructure Costs
+Whether you have 100 vehicles or 100,000, you still need Kafka, databases, and gateway services running. That's your platform base cost. Then each additional vehicle adds a small marginal cost on top.

-Shared infrastructure costs are distributed across active vehicles each collection interval:
+### Platform Base Resources (Fixed)

-| Component | Cloud ($/15min) | On-Prem ($/15min) |
-|-----------|-----------------|-------------------|
-| Storage, Event Hubs, Defender, monitoring | $10.00 | $0.90 |
+What it takes to run the cloud services platform:

-### Cost Rates (per hour)
+| Component | CPU (cores) | Memory (GB) | Notes |
+|-----------|-------------|-------------|-------|
+| Kafka brokers | 32 | 128 | 3-node cluster |
+| ClickHouse | 64 | 256 | 3 shards for HA |
+| MongoDB | 16 | 128 | Replica set |
+| Redis | 16 | 128 | Cluster mode |
+| PostgreSQL | 32 | 128 | Primary + replicas |
+| Gateway services | 8 | 64 | API gateway, auth |
+| Monitoring/logging | 8 | 64 | Prometheus, Grafana, Loki |
+| **Total Platform Base** | **176** | **896** | |

-| Resource | Cloud (Azure) | On-Prem |
-|----------|---------------|---------|
-| CPU/core | $0.30         | $0.02   |
-| Memory/GB| $0.08         | $0.005  |
+### Per-VIN Resources (Marginal)

-#### Cloud Rates (Fudged Higher)
- Based on Azure D-series VM pricing + 50% managed services overhead
- Includes: AKS compute, managed Kafka (Event Hubs), CosmosDB, Azure Storage, networking, monitoring
- Intentionally conservative (higher) to show true cloud TCO
+Incremental resources needed for each additional connected vehicle:

-#### On-Prem Rates (Fudged Lower)
- Based on 3-year hardware amortization only
- Assumes: owned hardware, minimal ops overhead
- Intentionally optimistic (lower) to show on-prem savings
- Does NOT include: datacenter costs, staff, power, cooling, network, maintenance
+| Activity Level | Messages/15min | CPU (millicores) | Memory (MB) |
+|---------------|----------------|------------------|-------------|
+| Low           | < 100          | 50               | 80          |
+| Medium        | 100-1000       | 75               | 120         |
+| High          | > 1000         | 100              | 160         |
+
+### Cost Rates
+
+| Resource | Cloud (Azure) | On-Prem/Bare Metal |
+|----------|---------------|-------------------|
+| CPU/core-hour | $0.30 | $0.02 |
+| Memory/GB-hour | $0.08 | $0.005 |
+| Managed Services/15min | $10.00 | $2.50 |
+
+#### Why On-Prem is ~90% Cheaper
+- **Platform base**: Same hardware, but cloud charges ~15x more for managed services
+- **Per-VIN compute**: Cloud VMs cost ~15x more than amortized bare metal
+- **Managed services**: Event Hubs, CosmosDB, etc. have significant markup vs self-hosted equivalents

 ### Savings Calculation

 ```
-Per-VIN Cost = (CPU_cores × rate + Memory_GB × rate) × hours + (base_infra / active_vins)
-Cloud Cost = Per-VIN costs summed across fleet
-On-Prem Cost = Same formula with on-prem rates
+Platform Base Cost = (176 cores × rate + 896 GB × rate) × hours
+Per-VIN Cost = (0.05 cores × rate + 0.08 GB × rate) × hours × activity_multiplier
+Total Cost = Platform Base + (Per-VIN × VIN count) + Managed Services
+
+Cloud Cost = Total with cloud rates
+On-Prem Cost = Total with on-prem rates  
 Savings = Cloud Cost - On-Prem Cost
 Savings % = (Savings / Cloud Cost) × 100
 ```

-Expected savings: **~85-88%** with on-prem hosting (hardware costs only).
+Expected savings: **~90%** with on-prem/bare metal hosting.

 ### Projected Annual Costs (5000 vehicles)

@@ -114,40 +122,50 @@ curl http://localhost:8077/cost/report
 ╔══════════════════════════════════════════════════════════════════╗
 ║                    COST SERVICE REPORT                           ║
 ╠══════════════════════════════════════════════════════════════════╣
-║  Period: 2026-01-03 to 2026-02-03
+║  Period: 2026-01-05 to 2026-02-05
 ╠══════════════════════════════════════════════════════════════════╣
 ║  FLEET OVERVIEW                                                  ║
 ║  ─────────────────────────────────────────────────────────────── ║
-║  Active Vehicles:     3143                                         
-║  Cloud Cost:          $5658.12                                      
-║  On-Prem Cost:        $389.98                                      
-║  Savings:             $5268.15 (93.1%)                             
+║  Active Vehicles:     3229                                         
+║  Cloud Cost:          $9761.61                                      
+║  On-Prem Cost:        $677.88                                      
+║  Savings:             $9083.73 (93.1%)                             
+╠══════════════════════════════════════════════════════════════════╣
+║  RESOURCE USAGE MODEL                                            ║
+║  ─────────────────────────────────────────────────────────────── ║
+║  Platform Base:       176 cores / 896 GB RAM (fixed)              
+║  Per-VIN Marginal:    50 millicores / 82 MB RAM                   
+║  Total Fleet:         337.5 cores / 1154.3 GB RAM                 
+╠══════════════════════════════════════════════════════════════════╣
+║  COST FORMULA                                                    ║
+║  ─────────────────────────────────────────────────────────────── ║
+║  (Platform Base) + (Per-VIN × 3229 VINs) + Managed Services       
 ╠══════════════════════════════════════════════════════════════════╣
 ║  COST RATES                                                      ║
 ║  ─────────────────────────────────────────────────────────────── ║
 ║  Cloud:   CPU $0.30/core-hr   Memory $0.080/GB-hr                 
 ║  On-Prem: CPU $0.02/core-hr   Memory $0.005/GB-hr                 
-║  Base Infra: Cloud $10.00/15min   On-Prem $0.90/15min             
+║  Base Infra: Cloud $10.00/15min   On-Prem $2.50/15min             
 ╠══════════════════════════════════════════════════════════════════╣
 ║  ANNUAL PROJECTION (based on current usage)                      ║
 ║  ─────────────────────────────────────────────────────────────── ║
-║  Cloud Annual:        $67897.44                                      
-║  On-Prem Annual:      $4679.70                                      
-║  Annual Savings:      $63217.74                                      
+║  Cloud Annual:        $117139.28                                      
+║  On-Prem Annual:      $8134.50                                      
+║  Annual Savings:      $109004.77                                      
 ╚══════════════════════════════════════════════════════════════════╝

 TOP COST VEHICLES:
-VIN                       Cloud $    On-Prem $  Savings %
-───────────────────    ──────────   ──────────   ────────
-VCF1EBU27PG008191            6.84         0.48      93.0%
-VCF1ZBU28PG005207            6.84         0.48      93.0%
-VCF1ZBU29PG005488            6.84         0.48      93.0%
-VCF1ZBU26PG004962            6.76         0.48      93.0%
-VCF1EBU20PG008145            6.75         0.47      93.0%
+VIN                    CPU (mc)   RAM (MB)      Cloud $    On-Prem $  Savings %
+───────────────────    ────────   ────────   ──────────   ──────────   ────────
+VCF1UBU21PG008884           100        164        14.10         1.04      92.6%
+VCF1EBU24PG007242           100        164        14.10         1.04      92.6%
+VCF1ZBU29PG006267           100        164        14.10         1.04      92.6%
+VCF1EBU26PG007307            75        123        13.41         0.99      92.6%
+VCF1EBU22PG011967            50         82        12.77         0.93      92.8%
 ...
 ```

-*Note: Report generated 2026-02-03. Costs accumulate over time as the collector runs every 15 minutes.*
+*Note: Report generated 2026-02-05. Costs accumulate over time as the collector runs every 15 minutes.*

 ## Configuration

--- a/services/cost/handlers/handlers.go
+++ b/services/cost/handlers/handlers.go
@@ -220,6 +220,16 @@ func GetReport(w http.ResponseWriter, r *http.Request) {
 ║  On-Prem Cost:        $%.2f                                      
 ║  Savings:             $%.2f (%.1f%%)                             
 ╠══════════════════════════════════════════════════════════════════╣
+║  RESOURCE USAGE MODEL                                            ║
+║  ─────────────────────────────────────────────────────────────── ║
+║  Platform Base:       %.0f cores / %.0f GB RAM (fixed)           
+║  Per-VIN Marginal:    %.0f millicores / %.0f MB RAM              
+║  Total Fleet:         %.1f cores / %.1f GB RAM                   
+╠══════════════════════════════════════════════════════════════════╣
+║  COST FORMULA                                                    ║
+║  ─────────────────────────────────────────────────────────────── ║
+║  (Platform Base) + (Per-VIN × %d VINs) + Managed Services        
+╠══════════════════════════════════════════════════════════════════╣
 ║  COST RATES                                                      ║
 ║  ─────────────────────────────────────────────────────────────── ║
 ║  Cloud:   CPU $%.2f/core-hr   Memory $%.3f/GB-hr                 
@@ -237,12 +247,20 @@ func GetReport(w http.ResponseWriter, r *http.Request) {
 	annualOnprem := summary.TotalOnpremCost * 12
 	annualSavings := annualCloud - annualOnprem

+	// Calculate total fleet resources
+	totalCPU := services.PlatformBaseCPUCores + (services.PerVinCPUCores * float64(summary.VehicleCount))
+	totalRAM := services.PlatformBaseMemoryGB + (services.PerVinMemoryGB * float64(summary.VehicleCount))
+
 	fmt.Fprintf(w, report,
 		from.Format("2006-01-02"), to.Format("2006-01-02"),
 		summary.VehicleCount,
 		summary.TotalCloudCost,
 		summary.TotalOnpremCost,
 		summary.TotalSavings, summary.SavingsPercent,
+		services.PlatformBaseCPUCores, services.PlatformBaseMemoryGB,
+		services.PerVinCPUCores*1000, services.PerVinMemoryGB*1024,
+		totalCPU, totalRAM,
+		summary.VehicleCount,
 		services.CloudCPUPerCoreHour, services.CloudMemoryPerGBHour,
 		services.OnpremCPUPerCoreHour, services.OnpremMemoryPerGBHour,
 		services.BaseInfraCloudCost, services.BaseInfraOnpremCost,
@@ -252,11 +270,11 @@ func GetReport(w http.ResponseWriter, r *http.Request) {
 	// Add top cost VINs if any
 	if len(summary.TopCostVins) > 0 {
 		fmt.Fprintf(w, "\nTOP COST VEHICLES:\n")
-		fmt.Fprintf(w, "%-20s %12s %12s %10s\n", "VIN", "Cloud $", "On-Prem $", "Savings %")
-		fmt.Fprintf(w, "%-20s %12s %12s %10s\n", "───────────────────", "──────────", "──────────", "────────")
+		fmt.Fprintf(w, "%-20s %10s %10s %12s %12s %10s\n", "VIN", "CPU (mc)", "RAM (MB)", "Cloud $", "On-Prem $", "Savings %")
+		fmt.Fprintf(w, "%-20s %10s %10s %12s %12s %10s\n", "───────────────────", "────────", "────────", "──────────", "──────────", "────────")
 		for _, v := range summary.TopCostVins {
-			fmt.Fprintf(w, "%-20s %12.2f %12.2f %9.1f%%\n",
-				truncateVIN(v.VIN), v.TotalCloudCost, v.TotalOnpremCost, v.SavingsPercent)
+			fmt.Fprintf(w, "%-20s %10.0f %10.0f %12.2f %12.2f %9.1f%%\n",
+				truncateVIN(v.VIN), v.AvgCPUCores*1000, v.AvgMemoryGB*1024, v.TotalCloudCost, v.TotalOnpremCost, v.SavingsPercent)
 		}
 	}
 }
--- a/services/cost/services/collector.go
+++ b/services/cost/services/collector.go
@@ -13,23 +13,29 @@ const (
 	CloudCPUPerCoreHour   = 0.30  // $/core/hour (Azure D-series + managed services + support)
 	CloudMemoryPerGBHour  = 0.08  // $/GB/hour (includes managed DB, Redis, caching layers)

-	// On-prem costs (amortized hardware only)
+	// On-prem/bare metal costs (amortized hardware only)
 	// Assumes: 3-year hardware amortization, minimal ops overhead
 	// Does NOT include: datacenter, power, cooling, staff, network
 	OnpremCPUPerCoreHour  = 0.02  // $/core/hour
 	OnpremMemoryPerGBHour = 0.005 // $/GB/hour

-	// Estimated resource usage per active VIN
-	// Connected vehicle telemetry pipeline: ingestion → Kafka → processing → storage → APIs
-	// Each active VIN requires dedicated processing capacity across the stack
-	EstimatedCPUPerVin    = 0.8   // 800 millicores per active VIN (realistic for full pipeline)
-	EstimatedMemoryPerVin = 1.2   // 1.2GB per active VIN (buffers, state, caches, connections)
+	// Per-VIN resource usage (marginal cost per vehicle)
+	// This is the incremental CPU/RAM needed for each additional connected vehicle
+	// Covers: telemetry ingestion, Kafka processing, storage writes, API queries
+	PerVinCPUCores  = 0.05  // 50 millicores per VIN (marginal)
+	PerVinMemoryGB  = 0.08  // 80MB per VIN (marginal)

-	// Base infrastructure cost per collection interval (shared services)
-	// Storage, Event Hubs, Defender, other fixed costs
-	// ~$45k/month fixed = ~$10/15min
-	BaseInfraCloudCost  = 10.00 // $/15min for shared infra (cloud)
-	BaseInfraOnpremCost = 0.90  // $/15min for shared infra (on-prem)
+	// Platform base resources (fixed cost regardless of VIN count)
+	// This is the minimum infrastructure to run the cloud services platform:
+	// Kafka brokers, ClickHouse, MongoDB, Redis, PostgreSQL, gateway services, etc.
+	PlatformBaseCPUCores  = 176.0  // 176 cores for base platform
+	PlatformBaseMemoryGB  = 896.0  // 896GB RAM for base platform
+
+	// Base infrastructure cost per collection interval (managed services, storage, networking)
+	// Cloud: Event Hubs, CosmosDB, Azure Storage, Defender, monitoring (~$45k/month)
+	// On-prem: ~75% cheaper - just hardware amortization for equivalent capacity
+	BaseInfraCloudCost  = 10.00 // $/15min for managed services (cloud)
+	BaseInfraOnpremCost = 2.50  // $/15min for equivalent on-prem (75% cheaper)
 )

 // CalculateCosts computes cloud and on-prem costs for given resource usage
@@ -43,6 +49,26 @@ func CalculateCosts(cpuCores, memoryGB float64, durationHours float64) (cloudCos
 	return
 }

+// CalculatePlatformBaseCosts returns the fixed platform infrastructure costs
+// This is the cost to run the platform regardless of VIN count
+func CalculatePlatformBaseCosts(durationHours float64) (cloudCost, onpremCost float64) {
+	cloudCost = (PlatformBaseCPUCores * CloudCPUPerCoreHour * durationHours) +
+		(PlatformBaseMemoryGB * CloudMemoryPerGBHour * durationHours)
+
+	onpremCost = (PlatformBaseCPUCores * OnpremCPUPerCoreHour * durationHours) +
+		(PlatformBaseMemoryGB * OnpremMemoryPerGBHour * durationHours)
+
+	return
+}
+
+// CalculatePerVinCosts returns the marginal cost for a single VIN
+func CalculatePerVinCosts(durationHours float64, activityMultiplier float64) (cpuCores, memoryGB, cloudCost, onpremCost float64) {
+	cpuCores = PerVinCPUCores * activityMultiplier
+	memoryGB = PerVinMemoryGB * activityMultiplier
+	cloudCost, onpremCost = CalculateCosts(cpuCores, memoryGB, durationHours)
+	return
+}
+
 // CalculateBaseCosts returns the shared infrastructure costs for a collection interval
 func CalculateBaseCosts() (cloudCost, onpremCost float64) {
 	return BaseInfraCloudCost, BaseInfraOnpremCost
@@ -79,12 +105,15 @@ func collectMetrics() {
 		return
 	}

-	// Convert to cost records
-	// Estimate resource usage based on message activity
 	durationHours := 0.25 // 15 minutes = 0.25 hours
 	records := make([]CostRecord, 0, len(activeVins))

-	// Get base infrastructure costs and distribute across active VINs
+	// Calculate platform base costs (distributed across all VINs for reporting)
+	platformCloudCost, platformOnpremCost := CalculatePlatformBaseCosts(durationHours)
+	platformCloudPerVin := platformCloudCost / float64(len(activeVins))
+	platformOnpremPerVin := platformOnpremCost / float64(len(activeVins))
+
+	// Get managed services base costs (also distributed)
 	baseCloud, baseOnprem := CalculateBaseCosts()
 	baseCloudPerVin := baseCloud / float64(len(activeVins))
 	baseOnpremPerVin := baseOnprem / float64(len(activeVins))
@@ -98,22 +127,20 @@ func collectMetrics() {
 			activityMultiplier = 1.5
 		}

-		cpuCores := EstimatedCPUPerVin * activityMultiplier
-		memoryGB := EstimatedMemoryPerVin * activityMultiplier
+		// Calculate per-VIN marginal costs
+		cpuCores, memoryGB, vinCloudCost, vinOnpremCost := CalculatePerVinCosts(durationHours, activityMultiplier)

-		cloudCost, onpremCost := CalculateCosts(cpuCores, memoryGB, durationHours)
-		
-		// Add share of base infrastructure costs
-		cloudCost += baseCloudPerVin
-		onpremCost += baseOnpremPerVin
+		// Total cost = per-VIN marginal + share of platform base + share of managed services
+		totalCloudCost := vinCloudCost + platformCloudPerVin + baseCloudPerVin
+		totalOnpremCost := vinOnpremCost + platformOnpremPerVin + baseOnpremPerVin

 		records = append(records, CostRecord{
 			VIN:           v.VIN,
 			Timestamp:     to,
 			CPUCores:      cpuCores,
 			MemoryGB:      memoryGB,
-			CloudCostUSD:  cloudCost,
-			OnpremCostUSD: onpremCost,
+			CloudCostUSD:  totalCloudCost,
+			OnpremCostUSD: totalOnpremCost,
 			PodCount:      1,
 		})
 	}
@@ -124,5 +151,6 @@ func collectMetrics() {
 		return
 	}

-	logger.Info().Msgf("Collected and stored cost metrics for %d VINs", len(records))
+	logger.Info().Msgf("Collected and stored cost metrics for %d VINs (platform base: cloud $%.2f, on-prem $%.2f)", 
+		len(records), platformCloudCost, platformOnpremCost)
 }
--- a/services/virtual-vehicle/README.md
+++ b/services/virtual-vehicle/README.md
@@ -45,3 +45,27 @@ This service simulates a connected vehicle (T.Rex) by:
 | 0x200 | Door status |
 | 0x201 | Light status |
 | 0x300 | HVAC |
+
+
+## Troubleshooting
+
+### 503 Error - Vault Certificate TTL Issue
+
+If the service returns a 503 and logs show a Vault error like:
+
+```json
+{"level":"error","message":"manufacturer API returned 503: {\"message\":\"Vault unable to create certificate, status code: 400 {\\\"errors\\\":[\\\"cannot satisfy request, as TTL would result in notAfter 2034-02-01T02:34:43.379661987Z that is beyond the expiration of the CA certificate at 2032-04-10T20:21:59Z\\\"]}\\n\",\"error\":\"Service Unavailable\"}"}
+```
+
+This happens when the certificate TTL requested (8 years) would extend beyond the CA certificate's expiration date. The manufacturer service requests a cert that expires in 2034, but the Vault CA expires in 2032.
+
+**Fix options:**
+
+1. **Reduce the PKI role TTL in Vault** *(non-disruptive - only affects new certs)*:
+   ```bash
+   vault write pki/roles/vehicle-cert ttl=17520h max_ttl=17520h  # 2 years
+   ```
+2. **Extend the CA certificate** - Regenerate the Vault PKI CA with a longer validity period *(disruptive - all existing certs need reissue)*
+3. **Rotate the CA** - Create a new CA with extended validity *(disruptive - requires cert reissue)*
+
+On dev-cluster, the CA was created with a ~10 year validity but the cert TTL request pushes past that.