Conclusion and future work

The project's bottom line, headline numbers in one place, and the directions we'd push next — including how a VLM could re-shape this pipeline.

1. Summary

This project studies street parking presence inference from street-level imagery through a multi-cue pipeline built around parking signs, parking meters, curb structure, curb color, and segment-level aggregation.

The strongest single cue remains the supervised parking-sign detector trained on MTSD, with final validation performance of mAP@50 = 0.5487, mAP@50–95 = 0.3824, image-level F1 = 0.6673, and AUROC = 0.8310. These results show that sign detection is a useful baseline, but the threshold and qualitative analyses also show that single-image sign detection is limited by small-object scale, viewpoint, and sign appearance variation.

The parking-meter and curb experiments refine the role of auxiliary cues. The COCO-pretrained parking-meter detector transfers partially to Mapillary Vistas, but low precision and many pole-like false positives mean parking meters should not be treated as a primary standalone cue. Curb segmentation is feasible and recovers useful structure, but curb-color inference must be conservative because strong painted curb colors are sparse and color estimates are sensitive to mask contamination. Boundary-based color extraction and confidence-margin rules make the curb-color output more interpretable, but the resulting signal remains auxiliary.

The final aggregation experiments provide the main systems-level result. On a controlled synthetic pseudo-segment benchmark with 225 five-image segments, segment-level aggregation improved F1 from 0.319 for a single-image per-segment baseline to 0.830. The gain is primarily driven by recall: the baseline misses most positive segments because a single selected view often does not contain the relevant cue, while aggregation can recover evidence from any of the five views. This directly supports the original motivation of the project: street parking evidence is sparse and viewpoint-dependent, so the segment should be inferred from multiple views rather than one image.

The manually collected six-segment real-world dataset provides qualitative validation of the same idea. Aggregation correctly recovered five of six positive examples, including meter-only, curb-only, and mixed-cue segments that the single-image sign baseline missed. The one failure case is also informative: the visible sign was outside the detector's training distribution, showing that aggregation can compensate for sparse visibility but not for missing visual concepts.

2. Headline numbers

0.5487Sign mAP@50
0.3824Sign mAP@50–95
0.6673Sign image-level F1
0.8310Sign AUROC
0.319Single-image baseline F1
0.830Aggregated F1
+0.51F1 lift from aggregation
5/6Real-world segments recovered

3. Key takeaways

The project demonstrates a complete pipeline from cue detection to segment-level inference. Parking-sign detection is a strong starting point, but robust street parking inference requires multi-view aggregation and auxiliary cues.

  • Parking signs are the strongest explicit cue, but single-image sign detection is fundamentally limited by small-object scale and viewpoint.
  • Parking meters transfer partially zero-shot — useful as an auxiliary cue but too noisy to stand alone.
  • Curb segmentation is feasible, but curb-color inference must be conservative; precision matters more than coverage.
  • Segment-level aggregation is the system-level win: weak cues become reliable when pooled across multiple views.
  • Aggregation has a hard limit — it cannot recover segments where every view is outside the detector's visual vocabulary (the seg_005 failure case).

4. Future work

Natural directions to extend this project, ordered roughly from "cheap, high-value" to "more open-ended":

Closer-to-real-world evaluation

  • Replace synthetic pseudo-segments with a larger georeferenced segment dataset, where five-view groupings reflect actual nearby street views rather than randomly sampled images.
  • Learn cue-fusion weights from validation data instead of hand-tuned reliability heuristics. The current 1.0 / 0.6 / 0.4 split is a reasonable starting point, but a small classifier on top of $(s_i, m_i, c_i)$ summary statistics would likely calibrate better.
  • Incorporate road-side geometry — left/right side of the road, distance along the segment, camera heading — to disambiguate which curb each cue is associated with.

Stronger per-cue detectors

  • Multi-scale and tiled inference for the sign detector to address the resolution-sensitivity findings — the qualitative experiments suggest meaningful gains are available without a new model architecture.
  • Extend sign understanding toward OCR and rule interpretation, so the system can answer "is parking allowed here, now, for this vehicle?" rather than just "is there a parking sign?"
  • Expand curb supervision to include curb color directly (rather than inferring it from segmentation + HSV), which would substantially reduce the contamination problem.
  • Domain-tune the parking-meter detector with a small in-domain labeled set to lift the zero-shot precision.

Vision-Language Models (VLMs)

A particularly promising direction is to replace or augment parts of this pipeline with a Vision-Language Model. VLMs are well-suited to this problem for several reasons:

  • Compositional sign understanding. Parking signs are inherently text + symbol assemblies (e.g. "No parking 8 AM–6 PM Mon–Fri except permit"). A VLM can read the entire sign, including time restrictions and exception clauses, in one pass — replacing the current "detect + crop + OCR + parse" cascade with a single vision-grounded reading step.
  • Multi-cue reasoning in natural language. Today the aggregation rule is a fixed weighted-max over three numerical scores. A VLM could reason directly over a set of street-view images and produce structured output like "parking is permitted on the west side of this segment between 9 AM and 5 PM" — integrating signs, meters, and curb cues without a hand-built fusion rule.
  • Better out-of-distribution behavior. The seg_005 failure case is the canonical illustration: a non-standard parking sign that is outside the MTSD training distribution. A VLM with broad visual-text grounding is more likely to recognize the visible text and symbols even without that exact sign type having been seen during fine-tuning.
  • Auxiliary supervision generation. Even without putting a VLM in the inference loop, a strong VLM can be used offline to label or re-label street-view images for fine-grained parking attributes, mitigating the "no good public benchmark" problem flagged in the proposal.

Concrete VLM-driven extensions worth trying:

  1. VLM as a single-image cue extractor. Prompt a VLM with a structured query ("List any parking-related evidence visible in this image: parking signs with their text, parking meters, and curb color") and feed its parsed output into the existing aggregation rule. Compare against the current sign + meter + curb stack.
  2. VLM as a segment-level reasoner. Pass all five views of a segment to a multi-image-capable VLM and ask for a single segment-level decision plus a justification. Evaluate against the synthetic and manual benchmarks.
  3. Hybrid: VLM-assisted hard cases. Keep the current cheap detector stack, and only escalate to a VLM when the aggregated score falls in an ambiguous middle range. This keeps inference cost low while improving the borderline-case precision.
  4. VLM-generated sign captions for OCR-free rule parsing. Use a VLM to produce a structured representation of each detected parking sign (e.g. JSON with days, times, restriction) that downstream rules can consume directly.

A useful framing: the current system shows that multi-view aggregation beats single-view inference. VLMs offer a way to extend that win to multi-modal aggregation within a single view — reading sign text, recognizing meter shapes, and judging curb appearance jointly, rather than as three separate detectors stitched together by a hand-tuned rule.