100 points by el_pa_b about 9 hours ago | 31 comments | View on ycombinator
mlhpdx 20 minutes ago |
tomnicholas1 about 1 hour ago |
Many of these scientific file formats (HDF5, netCDF, TIFF/COG, FITS, GRIB, JPEG and more) are essentially just contiguous multidimensional array(/"tensor") chunks embedded alongside metadata about what's in the chunks. Efficiently fetching these from object storage is just about efficiently fetching the metadata up front so you know where the chunks you want are [1].
The data model of Zarr [2] generalizes this pattern pretty well, so that when backed by Icechunk [3], you can store a "datacube" of "virtual chunk references" that point at chunks anywhere inside the original files on S3.
This allows you to stream data out as fast as the S3 network connection allows [4], and then you're free to pull that directly, or build tile servers on top of it [5].
In the Pangeo project and at Earthmover we do all this for Weather and Climate science data. But the underlying OSS stack is domain-agnostic, so works for all sorts of multidimensional array data, and VirtualiZarr has a plugin system for parsing different scientific file formats.
I would love to see if someone could create a virtual Zarr store pointing at this WSI data!
[0]: https://virtualizarr.readthedocs.io/en/stable/
[1]: https://earthmover.io/blog/fundamentals-what-is-cloud-optimi...
[2]: https://earthmover.io/blog/what-is-zarr
[3]: https://earthmover.io/blog/icechunk-1-0-production-grade-clo...
[4]: https://earthmover.io/blog/i-o-maxing-tensors-in-the-cloud
rwmj about 6 hours ago |
Interesting guide to the Whole Slide Images (WSI) format. The surprising thing for me is that compression is used, and they note does not affect use in diagnostics.
Back in the day we used TIFF for a similar application (X-ray detector images).
matthberg about 7 hours ago |
tokyovigilante about 6 hours ago |
Sleaker about 2 hours ago |
Edit: Looks like this is a slight discrepancy between the HN title and the GitHub description.
yread about 2 hours ago |
lametti about 6 hours ago |
isuckatcoding about 1 hour ago |
invaderJ1m about 4 hours ago |
Was there a requirement to work with these formats directly without converting?
Nora23 about 6 hours ago |
andrewstuart about 3 hours ago |
tonyhart7 about 7 hours ago |
huflungdung about 2 hours ago |
[1] https://github.com/mlhpdx/seekable-s3-stream