I love thinking of abstraction. By separating the concerns of data access and business logic, we are able to make changes to one without affecting the other. In many of my data engineering applications I’ve worked on we’ve read, processed, stored data in S3 in some way.

Typically I see lazy attempts at design like the below - where we don’t hide away S3 at all.

public class S3Service {
  private final S3BucketRepository s3BucketRepository;

  public S3Service(S3BucketRepository s3BucketRepository) {
    this.s3BucketRepository = s3BucketRepository;
  }

  public List<S3Object> listS3Objects(String bucketName, String prefix) {
    S3Bucket bucket = s3BucketRepository.getBucketByName(bucketName);
    return bucket.listObjects(prefix);
  }
}

I mean come on! Our client of S3Service now knows about S3 and our Service class knows we are using S3! Maybe in a local environment we want to use a local file system - or maybe we want to use OCI, GCP, Dropbox, Google Drive, or store the raw binary in Redis/DynamoDB behind a cache for superduper fast retrieval. So we must abstract away the details of our storage layer as an interface.

Define interfaces

Depending where you start - you might start with abstraction at the service/business layer or data layer. I’ve like to start at the data layer since it tends to expose the most details in the wrong places and want to keep it as contained as possible. Especially since there limitations with many data stores designing this layer right is critical.

So how do we abstract away the S3 object? Anything we store in S3 can be considered an “object” but in most applications I’ve worked on we retrieve them as “files”.

A File object might look like

public class File {
    private final String name;
    private final long size;
    private final Date lastModified;

    public File(String name, long size, Date lastModified) {
        this.name = name;
        this.size = size;
        this.lastModified = lastModified;
    }
}

Data Interface (optional)

If we’re really worried now about tight coupling we could abstract away this as an interface as well and expose some key potential methods instead

public interface File {
  String getName();
  long getSize();
  Date getlastModified();
  boolean exists();
  void delete();
}

public class S3File implements File {
  private final String name;
  private final long size;
  private final Date lastModified;

  public File(String name, long size, Date lastModified) {
    this.name = name;
    this.size = size;
    this.lastModified = lastModified;
  }
  // implement some methods below 
}

If you’re aware you might notice one thing - in practice you have two things that create your object in AWS S3

  1. Bucket Name - ex: s3://my-bucket
  2. Path - ex: some/path/to/file.json

which identifies your full path: s3://my-bucket/some/path/to/file.json

I have two options I think that works

  1. Add a baseDirectory field to the File class which we can hide the bucket information in and utilize it in our repository class.
  2. Make the bucket something specific to the repository implementation in question.

I’ll address option 2 in a further example.

Repository Interface

Next we need something to retrieve the data. Bam! Now we don’t care what stores our files.

public interface FileRepository {
    List<File> listFiles(String folder);
    File getFile(String fileName);
    void deleteFile(String fileName);
}

Using our new interfaces

Now our Service layer can be used like such

public class FileService {
  public class FileService {
    private final FileRepository fileRepository;

    public FileService(FileRepository fileRepository) {
      this.fileRepository = fileRepository;
    }

    public List<File> listFiles(String folder) {
      return fileRepository.listFiles(folder);
    }
  }

Cool - so we have defined the interfaces - lets implement the remaining Repository

Repository Implementation

Heres an example of option 2 where we make the bucket a parameter for S3FileRepository

public class S3FileRepository implements FileRepository {
    private final AmazonS3 s3Client;
    private final String bucketName;

    public S3FileRepository(AmazonS3 s3Client, String bucketName) {
        this.s3Client = s3Client;
        this.bucketName = bucketName;
    }

    public List<File> listFiles(String folder) {
        ListObjectsV2Result result = s3Client.listObjectsV2(bucketName, folder);
        return result.getObjectSummaries().stream()
                .map(summary -> new File(summary.getKey(), summary.getSize(), summary.getLastModified()))
                .collect(Collectors.toList());
    }
    // implement other methods
}


Handling lots of files

Most of the time you’ve got LOTS of files right? Maybe you need to iterate over a large subste to do something. We can use an Iterator pattern to do this. By abstracting away the details of the S3 client and bucket/prefix, we can easily iterate over S3 files using a generic iterator that can be used by any client code that needs to iterate over S3 files.

Perhaps the repository should be responsible for creating the iterator It might be better to shift the responsibility of providing the iterator to a separate class like a service or factory class. This could be done to further separate concerns and ensure that each class has a single responsibility.

public interface S3FileIterator extends Iterator<File> {
    // Empty interface for defining common behavior of iterators over S3 files
}

public class S3FileIteratorImpl implements S3FileIterator {
    private final AmazonS3 s3Client;
    private final String bucketName;
    private final String prefix;
    private final ListObjectsV2Request request;
    private List<S3ObjectSummary> objectSummaries;
    private int current;

    public S3FileIteratorImpl(AmazonS3 s3Client, String bucketName, String prefix) {
        this.s3Client = s3Client;
        this.bucketName = bucketName;
        this.prefix = prefix;
        this.request = new ListObjectsV2Request()
                .withBucketName(bucketName)
                .withPrefix(prefix);
        this.current = 0;
    }

    @Override
    public boolean hasNext() {
        if (objectSummaries == null || current >= objectSummaries.size()) {
            objectSummaries = getNextObjectSummaries();
            current = 0;
        }
        return !objectSummaries.isEmpty();
    }

    @Override
    public File next() {
        if (!hasNext()) {
            throw new NoSuchElementException();
        }
        S3ObjectSummary objectSummary = objectSummaries.get(current);
        current++;
        S3Object s3Object = s3Client.getObject(bucketName, objectSummary.getKey());
        return new File(s3Object.getKey(), s3Object.getObjectContent());
    }

    private List<S3ObjectSummary> getNextObjectSummaries() {
        ListObjectsV2Result result = s3Client.listObjectsV2(request);
        return result.getObjectSummaries();
    }
}

In the above example we need to be very careful if we use this in any multithreaded/concurrent application. The creator of iterator should create 1 per thread and the repository should not maintain any state of this.

Further Improvements

Some other TODO’s I’d like to explore that would help solidify this design so that we use it over the raw S3Client

  1. How might we do interesting things unique to S3 - like generating and sharing a presigned URL?
  2. How to further abstract this to handle S3 versioning?