Using Kart with S3 ------------------ Kart can import tile-based datasets directly from `Amazon S3 `_. The basic import command is the same as importing locally, some variation of: - ``kart import s3://some-bucket/path-to-laz-files/*.laz`` - ``kart import s3://some-bucket/path-to-tif-files/*.tif`` This will fetch the tiles and place them in the LFS cache. From this point on, it makes no difference that the tiles were originally fetched from S3 - they will be stored, pushed to a remote, or fetched from a remote as needed. This is in contrast to a "linked" tile-based dataset - explained below. Linked Datasets ~~~~~~~~~~~~~~~ For tile-based datasets where the original tiles are found on S3, Kart can reference the original tiles as the authoritative "master copy" of the tiles - this means there is no need for these tiles to be pushed and pulled between Kart repositories using the LFS protocol that would otherwise be used for transferring tiles. Instead, any Kart repo that needs the tiles will simply fetch them directly from the original source. This could be helpful for you if the following are true: * The original files will be hosted at their current location on S3 indefinitely. * The Kart repo and any clones of it will have read access to the tiles at their current location on S3. * You want to avoid duplicating the tiles to minimise hosted storage costs - you don't want them hosted both on S3 *and* the LFS server. In this case, you can add the ``--link`` option to the import command: ``kart import s3://some-bucket/path-to-tiles/*.[laz|tif] --link`` This creates a dataset where each tile in the dataset is linked to the original tile on S3 - it stores the S3 URL from which it was imported. Tiles with these URLs are not pushed to remotes or fetched from remotes like other tiles - they are always fetched from this URL, so there is no need to push them to any other remote. However, the metadata describing the dataset and the tiles is still pushed and fetched as in any other dataset. A user who clones a repository containing a linked dataset may not notice anything unusual. Ordinarily, the metadata would be fetched from the remote, then the tiles downloaded from the LFS server. For a linked dataset, the metadata is fetched from the remote as before, then the tiles are downloaded directly from their original location on S3. Either way, the user now has the relevant tiles in their working copy. No-checkout Option ^^^^^^^^^^^^^^^^^^ When importing a dataset, Kart generally checks out the newly imported dataset to the working copy immediately, but provides an option to skip this step. Ordinarily, skipping this step provides only limited benefits, since it only skips a local copy operation: it saves a bit of time and could save some disk space (depending on how the filesystem in question deals with duplicated data). However, it can be much more useful to do so when creating a linked dataset. This is because it allows for the creation of a linked dataset by extracting all the metadata of the tiles from S3, without actually downloading all the tiles to the local machine. Avoiding the download of a large dataset could save a lot of time and bandwidth, and associated costs from S3. To create a linked dataset without downloading the original data, use: ``kart import s3://some-bucket/path-to-tiles/*.[laz|tif] --link --no-checkout`` This dataset will not be checked out during the import operation, or any time later, until the user reverses their decision using ``kart checkout --dataset=PATH_TO_DATASET``. This configuration option only affects a single repository - if any user later clones the repository, the dataset will still be checked out as normal in their cloned repository, unless they too opt out. Note that for ``--no-checkout`` to work, the S3 objects referenced need to have SHA256 checksums attached, so that Kart can store the SHA256 hash without fetching the entire tile (see the "SHA256 hashes" section below). S3 Credentials ^^^^^^^^^^^^^^ Kart uses a standard AWS SDK to fetch data from S3. AWS credentials are loaded from the standard locations - AWS config files, environment variables, IAM roles, etc. If credentials are unnecessary and unavailable, the environment variable ``AWS_NO_SIGN_REQUEST`` should be set to 1. Editing Tiles ^^^^^^^^^^^^^ Currently, Kart does not write to S3 on the user's behalf for any reason. This means linked datasets cannot be edited by modifying the working copy - committing such changes is prevented. Users may opt to write the required changes to S3 themselves, at which point they can use the ``kart import --replace-existing --link`` command to create a new version of the linked dataset. However, when doing so, take care not to overwrite any of the original tiles in S3, since that would break the requirement that Kart can continue to access those files whenever older versions of the dataset are checked out. SHA256 hashes ~~~~~~~~~~~~~ Kart uses `Git LFS `_ pointer files to point to point-cloud or raster tiles - even when those tiles are found in S3, rather than on a Git LFS server. For more details, see the section on :doc:`Git LFS ` In order to create a linked dataset where every tile is backed by an object on S3, Kart needs to learn the SHA256 hash of each object in order to populate the pointer file. Currently, Kart does this by fetching the tiles and computing the hash S3 itself - or if ``--no-checkout`` is specified, by querying the SHA256 checksum from S3, which works as long as the S3 objects already have SHA256 checksums attached (which is not guaranteed). If you need to add SHA256 hashes to existing S3 objects, this Python snippet using `boto3 `_ could be a good starting point. It copies an object from `key` to the same `key`, overwriting itself, but adds a SHA256 hash as it does so. It **does not** preserve the object's ACL - if you have set an ACL on the object, you will need to set it again. .. code:: python import boto3 boto3.client("s3").copy_object( Bucket=bucket, Key=key, CopySource=dict(Bucket=bucket, Key=key), ChecksumAlgorithm="SHA256", )