Practical challenges and limitations

What got in the way during the experiments and what the results don't yet prove. The wrap-up and future directions live on the Conclusion page.

1. Infrastructure and storage constraints

The datasets and preprocessing artifacts are large — tens of gigabytes of images and annotations — which created repeated issues across local storage, Google Colab, and remote execution environments. MTSD was 36 GB and Mapillary Vistas 32.6 GB. In particular:

  • Google Colab storage limits made it difficult to keep the full dataset and processed outputs in one place.
  • We initially explored multiple environments — local setup, cloud-based options.
  • We attempted to use CloudLab but acquiring GPU instances was difficult. Even when a short A30 allocation was obtained, the combination of large data transfer and missing GPU drivers made the setup impractical.
  • Kaggle itself introduced interruptions: the sign training run crashed after epoch 24 and had to be resumed separately.

2. Evaluation on local hardware

On a local Apple Silicon machine, the first attempt at full-image inference for the validation split was killed during prediction because the script tried to pass the entire validation set in a single batch. We fixed this by changing the evaluation code to use chunked prediction. This was an important engineering improvement because it made local validation feasible and reproducible.

Parking-meter evaluation introduced additional difficulties. Full zero-shot evaluation with YOLO11x at high inference resolution was slow locally, and moving the evaluation to Kaggle required extra work because of GPU/runtime compatibility issues. We eventually ran the full validation experiment by restructuring the evaluation code into chunked batches on Kaggle.

3. Task difficulty

The dataset contains many negatives and, more importantly, extremely small positive objects. The hardest part of the sign-detection problem is not simple class imbalance, but small-object detection under varied viewpoint and scale. The qualitative experiments with zoom and inference resolution strongly confirmed this.

The same issue reappears in the parking-meter experiments. Many zero-shot failures were caused by meters being too small, too far away, or visually confusable with generic poles and other narrow street furniture.

A related but distinct difficulty appeared in the curb experiments. Although curb annotations are much more abundant than parking-meter annotations, curb color is not directly supervised and is much harder to recover robustly than curb presence. The main challenge was not simply detecting curb pixels but ensuring those pixels actually corresponded to the curb itself rather than adjacent crosswalk paint, road markings, or asphalt. This made the curb module less of a pure segmentation problem and more of a joint segmentation-and-representation problem, where uncertainty handling became an essential part of the final design.

4. Limitations of the aggregation evaluation

The segment-level aggregation experiments are useful but should be interpreted with the right scope.

The synthetic benchmark

The synthetic pseudo-segment benchmark tests the mathematical and systems behavior of aggregation under sparse cue visibility. It does not prove full geographic street-segment reasoning — the synthetic segments are assembled from images across datasets rather than from a road network. This was a deliberate design choice: the available public datasets provide image-level annotations for signs, meters, and curbs, but not clean, labeled street-segment parking ground truth.

The real-world six-segment dataset

The manually collected six-segment dataset addresses this limitation qualitatively by using real nearby views from actual street locations. It is intentionally small — manual collection is slow, and each example requires finding a suitable street, capturing multiple nearby views, ensuring the views correspond to the same local curbside context, and writing notes about the visible cue pattern. The manual examples are best understood as real-world validation cases, not a statistically significant benchmark.

Heuristic aggregation rule

The aggregation rule itself is heuristic. The weights for signs (1.0), meters (0.6), and curb color (0.4) are based on observed cue reliability rather than learned calibration. This is appropriate for the current project stage because the goal is to demonstrate the value of aggregation, but a future system should learn cue weights on a larger georeferenced validation set and explicitly model spatial consistency, road side, and distance along the street.


For the project's wrap-up summary and future-work directions, head to the Conclusion page.