File upload and disk quota

This document covers the details of how to handle file uploads in BASE. A discussion about disk quota is also included.

Contents
  1. Files and directories
  2. Secondary storage
  3. Disk quota
See also

Last updated: $Date: 2009-04-06 14:52:39 +0200 (må, 06 apr 2009) $

1. Files and directories

1.1 Files

  1. BASE should be able to store files related to experiments and other items.
  2. A file may have a type, for example, raw data, protocol, reporter list, etc. The type of the file is only used for giving client applications a better way to filter files, not for stopping a user from using a file wherever it can be used.
  3. BASE should keep track if a file has been used or not. With "used" we mean that it has been linked to another item, for example a protocol.
  4. It should be possible to use a file multiple times.
  5. A user should be able to delete the "physical" file from disk, but the information about the file should still remain in the database.
  6. A deleted file can be re-uploaded in case it is needed again.
  7. BASE may rename an uploaded file to avoid overwriting an existing one.
  8. The core should calculate and store a unique value, for example the MD5 sum, for each file. This value is used to warn a user that is re-uploading a file. The user is not prevented from uploading since it is possible that errors may have been corrected.
  9. Files brought back from secondary storage, should however be checked for a valid MD5 value.

1.2 Directories

  1. It should be possible to create a directory structure. The directory structure doesn't have to be physically represented on the disk.
  2. Each directory may contain multiple files, but a single file can only appear inside one directory.
  3. The directory structure may not limit how a file can be used, but is only used as a means for users to organise their files.
  4. A client application may ignore the directory structure and display all files as if they were part of the root directory.
  5. It is not possible to delete directories that contains files. [NOTE] Alternatively, if a directory that contains files is deleted the files are moved to the root directory, but this is a client issue and not a core issue.
  6. Some client applications, for example a FTP client, requires that a file can be uniquely identified by name. This implies that all files in a directory must be unique.
  7. [QUESTION] How do we handle sharing of files with users and groups? Should we require that all parent directories must also be shared? Or do we magically "create" a parallel directory structure like: /shared/johan, /shared/nicklas
    [ANSWER] This is mainly a client issue. But, the core must allow a client to traverse the path leading to the file.

2. Secondary storage

  1. BASE can be configured to support a secondary storage, where files that are rarely used can be placed, for example on tape-backup.
    [NOTE] The secondary storage is intended to be used for large files that are not regularly used once they have been parsed after the upload, for example images and raw data files. Such files may be moved to cheaper long-term storage.
  2. A user may flag that a file should be moved to the secondary storage. Information about the file should remain in the database.
  3. A user may flag that a file placed in the secondary storage should be retreived and placed in the primary storage again.
  4. The BASE core will only handle the flagging of files to be moved. It is the responsibility of an external application to actually move the files between the primary and secondary storage.
  5. The external application should check if files need to be moved at regular intervals. For example once every night.
  6. A file that is placed in secondary storage can also be flagged to be deleted.

3. Disk quota

  1. A user must be assigned a disk quota that may not be exceeded.
  2. The quota is checked in the beginning of an operation, ie. before uploading a file. If the check is successful the operation is allowed to proceed, even if the quota is exceeded after the operation. [NOTE] This is because if a plugin runs for several hours it should not be rejected while saving the result.
  3. The quota applies to uploaded files, and other data that takes a lot of disk space. What we mean with "other data" and "lot of disk space" is decided for each case and should not matter to the quota system.
  4. Quota values may be specified as a total sum, or with values for each type of data or file.
  5. It should be possible to have independent quota settings for primary and secondary storage.
  6. Files that have been deleted should not be counted.
  7. [IMPLEMENTATION NOTE] Checking against quota values is something that is done fairly often. Used disk space should not have to recalculated each time. A cache holding the most recent values should be considered.
  8. A group may also be assigned quota values.
  9. A user may be configured to use the quota from one of the groups where the user is a member. Then, both the user's individual quota and the group's quota are checked.
  10. The amount of disk space used should be stored per user, item and type. It will then be possible to generate reports over disc usage for groups and projects as well.
  11. [QUESTION] How do we handle removing a user from a group from which the user has used quota? How do we handle adding a user to a group? How do we handle changing owner on an item that uses quota?

    Answer: We do not make any automatical changes that require batch updates. If a user is removed/added from the quota group, the disc usage is still counted against the original group. Changing the owner of the item will cause the disc usage to be taken over by the new owner.