5 min read

What We're Learning About GPU Infrastructure for Visual Imaging ML in Research Settings

What We're Learning About GPU Infrastructure for Visual Imaging ML in Research Settings

Research teams running visual imaging machine learning—whether analyzing kelp forest imagery, tracking wildlife populations, or processing satellite data—face a problem that's remarkably consistent across domains: the infrastructure they need to run experiments is either inaccessible, unpredictable, or both.

We're building Pandoro to address this, and in the process, we're learning a lot about what researchers actually need versus what the market typically offers. This piece is a reflection on what we've observed so far—including where we're still figuring things out.


The numbers are worse than we expected


A Brown University research team recently studied the feasibility of pre-training models with academic resources (Khandelwal et al., October 2024). As part of their research, they surveyed 50 researchers across 35 institutions. Three statistics stood out:

  • Two-thirds rated their satisfaction with computing resources at 3 out of 5 or below.
    Researchers cited long wait times and inadequate hardware for large-scale experiments as primary frustrations.
  • 85% of respondents had zero budget for cloud compute.
    The vast majority relied entirely on on-premises clusters rather than commercial cloud services like AWS or Google Cloud. The survey noted that institution-owned hardware was considered more cost-effective long-term, but less flexible.
  • Only 10% had access to current-generation GPUs.
    Most researchers work with mid-tier or older hardware—RTX 3090s, A6000s—rather than workstation-class GPUs like the RTX 6000 Pro that make visual imaging ML practical.

The typical researcher needs 1-8 GPUs for days to weeks at a time—resources most don't have reliable access to.

This matches what we hear in conversations. The problem isn't that researchers don't know cloud GPUs exist—it's that the economics and access models don't fit their reality.


The budget predictability problem


Usage-based cloud pricing creates a specific challenge for grant-funded research: you often can't know in advance how many iterations an experiment will require.

When training a model to identify species in underwater imagery, for example, you might need to experiment with different architectures, hyperparameters, and data augmentation approaches before finding something that works. Each iteration costs money. If your cloud bill varies by 3x depending on how many experiments you run, budgeting becomes guesswork.

This isn't a flaw in how researchers work—experimentation is how science happens. But hourly billing assumes you can predict compute needs in advance, which often isn't true for exploratory research.

We observe this tension repeatedly: researchers who could benefit from GPU acceleration either avoid it entirely (running experiments on CPUs over weeks instead of hours) or limit their experimentation to stay within uncertain budgets.

The institutional access problem


Institutional GPU resources are unevenly distributed, and access—where it exists—typically requires navigating queues and approval processes. For many researchers, this means multi-day or multi-week waits during peak periods—particularly problematic when deadlines approach.

Some institutions mandate Windows environments while most ML frameworks are optimized for Linux. Others require IT approval processes that add weeks before hardware is available. These aren't universal problems, but they're common enough that researchers frequently cite institutional barriers alongside budget constraints.


The reproducibility problem


This is the barrier we find most interesting from a research integrity perspective.

Scientific publication increasingly requires reproducible computational environments. But as noted in Nature Methods (Heil et al., 2021), even containerized code does not "fully insulate the running environment from the underlying hardware. Authors expecting bit-for-bit reproducibility from their containers may find that GPU-accelerated code fails to yield identical results on other machines due to the presence of different hardware or drivers."

A separate analysis in AI Magazine (Semmelrock et al., 2025) identifies hardware differences as one of several key barriers to ML reproducibility, documenting how different software versions, hyperparameter settings, and hardware configurations contribute to irreproducible results.

The implication: if you train a model on abstracted cloud infrastructure where hardware specifications aren't disclosed, other researchers may struggle to reproduce your results. For published research, this matters.

Most cloud providers abstract hardware details behind instance types and virtualization layers. This simplifies infrastructure management but undermines the transparency that scientific publication requires.

What we're building and the tradeoffs we're navigating


Pandoro provides dedicated consumer GPU systems for researchers at fixed monthly pricing with full hardware transparency. The approach involves specific tradeoffs we're actively weighing:

Consumer vs. enterprise GPUs: We use consumer-grade hardware rather than enterprise H100 systems. Our hypothesis: for most visual imaging ML workflows—species identification, environmental monitoring, climate data analysis—consumer hardware is adequate at a fraction of the cost. This won't work for training large language models, but that's not our target use case.

Fixed vs. usage-based pricing: We charge monthly, regardless of how many experiments you run. This creates predictability for budget planning but means you pay the same whether you use the system heavily or lightly. For researchers who need to run many iterations, this is advantageous. For those with intermittent needs, it may not be.

Dedicated vs. shared resources: Each researcher gets dedicated hardware rather than shared infrastructure. This simplifies the experience (no queues, no contention) but is less efficient than shared systems at scale. We think the simplicity matters more for our target users than maximum resource efficiency.

Hardware transparency vs. abstraction: We disclose complete hardware specifications—exact GPU model, RAM, storage, system configuration. This supports reproducibility documentation but means we can't dynamically migrate workloads between different hardware types the way abstracted systems can.


Questions we're still figuring out


We don't have all the answers yet. Some things we're actively working through:

Support models for domain scientists: Our target users are researchers with deep domain expertise (e.g. marine biology, climate science, environmental monitoring) but often without ML engineering backgrounds. What level of technical support do they actually need? How do we balance being helpful with not creating dependency?

Scaling approach: Dedicated hardware per researcher works at small scale. As demand grows, how do we maintain the simplicity that makes the model work while building sustainable infrastructure?

Pricing that works for both sides: Grant budgets have specific constraints. Monthly pricing needs to be affordable enough to fit research budgets while sustainable enough to maintain quality infrastructure. We think we've found a reasonable balance, but we're early enough that this remains a hypothesis.

If this resonates


We're in the early stages of validating this approach with research teams doing visual imaging ML. If the barriers described here match your experience, or if you see gaps in our thinking, we'd welcome a conversation.


Building with GPU infrastructure challenges?

Pandoro provides dedicated GPU systems for visual imaging ML research—fixed monthly pricing, full hardware transparency, no queues. Currently accepting early access researchers.

Get in touch

Sources: