j.ohnson.com

Building a Text File Database Part 1: Exploration

I recently wanted to start keeping track of what I was reading. I don't know why but instead of using a spreadsheet I put them in a txt file

	id     |start     |finish    |author                          |title
	1      |2023-08-01|2023-09-04|Mark Z. Danielewski             |House Of Leaves
	2      |2023-09-04|2023-09-23|Stephen King                    |On Writing
	3      |2023-09-24|          |Andrew Weir                     |The Martian
	4      |          |          |Stephen King                    |The Running Man

I think it's pretty self-explanatory; I've completed House of Leaves and On Writing, The Martian has been started but I'm not done yet, and I have not yet begun The Running Man.

As the list got a little longer I thought it would be fun to create a small app and exercise those parts of software development that don't get worked out much in the corporate world. The app would be simple, just a place to view my little list, add books to it, and click start and finish when I'm starting and finishing a book, respectively. My first instinct was to not fuss around with data storage and use a solution like h2 database and import my list into it and be done. But as I poked around in h2 and similar options I kept looking back at my nice little list and, well, I just wanted to parse it myself. It's already in an easily-parsable format and I added ids for some reason.

Whenever I've had to deal with files I always take the "read entire file into memory/write entire file at once" approach but I've always wanted to seriously deal with files by only reading data from the file that was needed and only writing the data that needed to be written even though "having too much data to read into memory" is never going to be a realistic constraint on this project. After poking around to see what my options were in the Java ecosystem I landed on Java NIO which has a rich API for reading and writing asynchronously to files.

The first thing I would need to do would be to read from my existing file to represent it in some way. This was pretty straight forward:

    AsynchronousFileChannel fileChannel = AsynchronousFileChannel.open(file, WRITE, READ);
    ByteBuffer buffer = ByteBuffer.allocate(1024);
    Future<Integer> op = fileChannel.read(buffer, 0);
    op.get();
    String dataFromFile = new String(buffer.array());

Provided my initial row isn't larger than 1024 bytes I'd need only to find the end-of-line and I'd have my first record (which is the column definition):

    String dataFromFile = new String(buffer.array());
    int i = dataFromFile.indexOf(System.lineSeparator());
    String firstRecord = dataFromFile.substring(0, i);

From there I could split on the | symbol.

If I wanted to read row-by-row (not sure if this is the most efficient approach but it is an assumption I'm starting with) I'd need to allocate a buffer with enough size to initially read in a row. The first observation I made toward design was here: there would need to be an upper limit placed on total record size. Sure I could read in larger chunks or check and continue to read until I found the EOL but I had to set a boundary somewhere lest I get bogged down in implementation details which may never come to pass (accounting for reading a line that's too big for memory, etc).

Feeling satisfied with how reading would go I turned my attention to what writing would look like. Similarly as simple:

    ByteBuffer buffer = ByteBuffer.allocate(1024);
    buffer.put("Some Data".getBytes());
    buffer.flip();
    Future<Integer> op = fileChannel.write(buffer, fileChannel.size());
    op.get();

This would append data to the end of the file, starting at position fileChannel.size(). If that was swapped out with a 0 then "Some Data" would overwrite the first 9 characters at the beginning of the file. I now had a mechanism to append data to the file and to overwrite a field for updating. But what if I wanted to insert a row in the middle and "push down" the other rows? How would I handle an instance where updating the title (end of the line, not a fixed size) resulted in a longer field? A shorter field? What if I wanted to have a review at the end, which would be an unknown size? After some more exploring this would turn out to be naivete on my part about how files and file systems work in general; files can have data appended or overwritten but cannot have data inserted in the middle. I think I knew this somewhere in the back of my mind but never had to work concretely with it. This limitation could be overcome by creating a mechanism that reads in records/chunks at the end of the file, appends them to the end, and then work my way up until I can insert the new record. I would also probably want a mechanism to make sure I wasn't accidentally overwritting data I didn't mean to and handle possible exceptions gracefully that would otherwise leave the file in a currupted state, losing data. My TODO list was growing and so here I made two more design decisions:

  1. New records would only be appended to the file
  2. Lines would be a fixed length, the last field would be padded with spaces when necessary

It was starting to come into focus why traditional databases have the constraints that they do. But having these constraints opened up a world to explore my design and provided some welcome accidents later on. For starters, only the first row, the column definition, would need to use a larger buffer and then search for the EOL. After splitting on | I would know the column size for each field by counting the spaces. After that I would know the exact size of records and could allocate a buffer of exactly that size with no need to check for EOL.

    public BookRepositoryFileImpl() throws IOException, ExecutionException, InterruptedException {
        fileChannel = AsynchronousFileChannel.open(Paths.get("/Users/josh/Desktop/projects/bookDb/bookTest.db"), WRITE, READ);

        ByteBuffer buffer = ByteBuffer.allocate(1024);
        Future<Integer> op = fileChannel.read(buffer, 0);
        op.get();
        String s = new String(buffer.array());
        recordLength = s.indexOf(System.lineSeparator()) + 1;
        fileSizeWithoutColumnDefinitions = fileChannel.size() - recordLength;

        String[] cols = s.split("\\|");
        for (int i = 0; i < cols.length; i++) {
            columns.put(cols[i].trim(), i);
        }
    }

Constraining myself to a fixed record length provided a neat way to get the total number of records:

    public long total() {
        return fileSizeWithoutColumnDefinitions / recordLength;
    }

I quickly found a way to use a Stream to scan over the lines and apply a "criteria" which would prevent further mapping work if the record was not part of it.

    private Stream<Book> streamAll(Criteria criteria) {
        ByteBuffer buffer = ByteBuffer.allocate(recordLength);

        return Stream.iterate(
                        // Begin position at recordLength, the end of the column definition
                        // While position is less than the file size
                        // Increment position by recordLength; move to next record
                        recordLength,
                        n -> {
                            try {
                                return n < fileChannel.size();
                            } catch (IOException e) {
                                throw new RuntimeException(e);
                            }
                        },
                        n -> n + recordLength)
                .map(readFromPosition -> {
                    // Read next record from readFromPosition into a buffer
                    buffer.clear();
                    Future<Integer> op = fileChannel.read(buffer, readFromPosition);
                    try {
                        op.get();
                    } catch (InterruptedException | ExecutionException e) {
                        throw new RuntimeException(e);
                    }
                    return buffer;
                })
                .map(b -> {
                    // Transform buffer into a String and split on `|` to put
                    // the fields into an array, trimming each entry of extra padding
                    String s = new String(b.array());

                    return Arrays.stream(s.split("\\|"))
                            .map(String::trim)
                            .toList();
                })
                .filter(cols -> {
                    // If any fields are being checked for equality, check them here
                    if (criteria.equals.isEmpty()) {
                        return true;
                    }

                    return criteria.equals
                            .stream()
                            .map(eq -> {
                                if (cols.get(columns.get(eq.getKey())).equals(eq.getValue())) {
                                    return true;
                                }
                                return false;
                            })
                            .findFirst()
                            .orElseThrow();
                })
                .filter(cols -> {
                    // If any fields are being checked for emptiness, remove them here
                    if (criteria.columnsNotEmpty.isEmpty()) {
                        return true;
                    }

                    return criteria
                            .columnsNotEmpty
                            .stream()
                            .map(columnName -> {
                                if (cols.get(columns.get(columnName)).isEmpty()) {
                                    return false;
                                }
                                return true;
                            })
                            .findFirst()
                            .orElseThrow();
                })
                // Map to domain object
                .map(cols -> new Book(cols.get(columns.get("id")), cols.get(columns.get("title")), cols.get(columns.get("author"))));
    }

This allowed for a nice, clean API. For example, here was a check for which Books have been finished:

    public long finished() {
        return streamAll(new Criteria().andColumnNotEmpty("finish"))
                .count();
    }

It's not "efficient" in that it will need to look at every row in some capacity to choose whether to count it but it works well enough for my needs. Maybe I'll explore indexing in a later installment. For better or worse I turned my efforts toward separating file access from domain logic next.