1. Summary
This project studies street parking presence inference from street-level imagery through a multi-cue pipeline built around parking signs, parking meters, curb structure, curb color, and segment-level aggregation.
The strongest single cue remains the supervised parking-sign detector trained on MTSD, with final validation performance of mAP@50 = 0.5487, mAP@50–95 = 0.3824, image-level F1 = 0.6673, and AUROC = 0.8310. These results show that sign detection is a useful baseline, but the threshold and qualitative analyses also show that single-image sign detection is limited by small-object scale, viewpoint, and sign appearance variation.
The parking-meter and curb experiments refine the role of auxiliary cues. The COCO-pretrained parking-meter detector transfers partially to Mapillary Vistas, but low precision and many pole-like false positives mean parking meters should not be treated as a primary standalone cue. Curb segmentation is feasible and recovers useful structure, but curb-color inference must be conservative because strong painted curb colors are sparse and color estimates are sensitive to mask contamination. Boundary-based color extraction and confidence-margin rules make the curb-color output more interpretable, but the resulting signal remains auxiliary.
The final aggregation experiments provide the main systems-level result. On a controlled synthetic pseudo-segment benchmark with 225 five-image segments, segment-level aggregation improved F1 from 0.319 for a single-image per-segment baseline to 0.830. The gain is primarily driven by recall: the baseline misses most positive segments because a single selected view often does not contain the relevant cue, while aggregation can recover evidence from any of the five views. This directly supports the original motivation of the project: street parking evidence is sparse and viewpoint-dependent, so the segment should be inferred from multiple views rather than one image.
The manually collected six-segment real-world dataset provides qualitative validation of the same idea. Aggregation correctly recovered five of six positive examples, including meter-only, curb-only, and mixed-cue segments that the single-image sign baseline missed. The one failure case is also informative: the visible sign was outside the detector's training distribution, showing that aggregation can compensate for sparse visibility but not for missing visual concepts.
2. Headline numbers
3. Key takeaways
The project demonstrates a complete pipeline from cue detection to segment-level inference. Parking-sign detection is a strong starting point, but robust street parking inference requires multi-view aggregation and auxiliary cues.
- Parking signs are the strongest explicit cue, but single-image sign detection is fundamentally limited by small-object scale and viewpoint.
- Parking meters transfer partially zero-shot — useful as an auxiliary cue but too noisy to stand alone.
- Curb segmentation is feasible, but curb-color inference must be conservative; precision matters more than coverage.
- Segment-level aggregation is the system-level win: weak cues become reliable when pooled across multiple views.
- Aggregation has a hard limit — it cannot recover segments where every view is outside the detector's visual vocabulary (the seg_005 failure case).
4. Future work
Natural directions to extend this project, ordered roughly from "cheap, high-value" to "more open-ended":
Closer-to-real-world evaluation
- Replace synthetic pseudo-segments with a larger georeferenced segment dataset, where five-view groupings reflect actual nearby street views rather than randomly sampled images.
- Learn cue-fusion weights from validation data instead of hand-tuned reliability heuristics. The current 1.0 / 0.6 / 0.4 split is a reasonable starting point, but a small classifier on top of $(s_i, m_i, c_i)$ summary statistics would likely calibrate better.
- Incorporate road-side geometry — left/right side of the road, distance along the segment, camera heading — to disambiguate which curb each cue is associated with.
Stronger per-cue detectors
- Multi-scale and tiled inference for the sign detector to address the resolution-sensitivity findings — the qualitative experiments suggest meaningful gains are available without a new model architecture.
- Extend sign understanding toward OCR and rule interpretation, so the system can answer "is parking allowed here, now, for this vehicle?" rather than just "is there a parking sign?"
- Expand curb supervision to include curb color directly (rather than inferring it from segmentation + HSV), which would substantially reduce the contamination problem.
- Domain-tune the parking-meter detector with a small in-domain labeled set to lift the zero-shot precision.
Vision-Language Models (VLMs)
A particularly promising direction is to replace or augment parts of this pipeline with a Vision-Language Model. VLMs are well-suited to this problem for several reasons:
- Compositional sign understanding. Parking signs are inherently text + symbol assemblies (e.g. "No parking 8 AM–6 PM Mon–Fri except permit"). A VLM can read the entire sign, including time restrictions and exception clauses, in one pass — replacing the current "detect + crop + OCR + parse" cascade with a single vision-grounded reading step.
- Multi-cue reasoning in natural language. Today the aggregation rule is a fixed weighted-max over three numerical scores. A VLM could reason directly over a set of street-view images and produce structured output like "parking is permitted on the west side of this segment between 9 AM and 5 PM" — integrating signs, meters, and curb cues without a hand-built fusion rule.
- Better out-of-distribution behavior. The seg_005 failure case is the canonical illustration: a non-standard parking sign that is outside the MTSD training distribution. A VLM with broad visual-text grounding is more likely to recognize the visible text and symbols even without that exact sign type having been seen during fine-tuning.
- Auxiliary supervision generation. Even without putting a VLM in the inference loop, a strong VLM can be used offline to label or re-label street-view images for fine-grained parking attributes, mitigating the "no good public benchmark" problem flagged in the proposal.
Concrete VLM-driven extensions worth trying:
- VLM as a single-image cue extractor. Prompt a VLM with a structured query ("List any parking-related evidence visible in this image: parking signs with their text, parking meters, and curb color") and feed its parsed output into the existing aggregation rule. Compare against the current sign + meter + curb stack.
- VLM as a segment-level reasoner. Pass all five views of a segment to a multi-image-capable VLM and ask for a single segment-level decision plus a justification. Evaluate against the synthetic and manual benchmarks.
- Hybrid: VLM-assisted hard cases. Keep the current cheap detector stack, and only escalate to a VLM when the aggregated score falls in an ambiguous middle range. This keeps inference cost low while improving the borderline-case precision.
- VLM-generated sign captions for OCR-free rule parsing. Use a VLM to produce a structured representation of each detected parking sign (e.g. JSON with days, times, restriction) that downstream rules can consume directly.
A useful framing: the current system shows that multi-view aggregation beats single-view inference. VLMs offer a way to extend that win to multi-modal aggregation within a single view — reading sign text, recognizing meter shapes, and judging curb appearance jointly, rather than as three separate detectors stitched together by a hand-tuned rule.