digestdb Documentation

digestdb Documentation
Release 16.08.01
Chris Laws
August 21, 2016
Contents
1
User Guide
3
2
API
7
3
Developers Guide
9
i
ii
digestdb Documentation, Release 16.08.01
Digestdb provides database style (e.g. SQL) access to binary data files stored in a balanced set of file system directories.
Digestdb aims to provide an efficient strategy for storing and serving lots of binary files while maintaining a high level
of performance.
DigestDB was developed specifically for scenarios that required storing and recalling large numbers of large (~100K
- ~40MB) binary blobs.
A pure database solution did not seem to be the right choice for the storing lots of binary data. The file system works
just fine for storing and accessing data files. Digestdb blends the two approaches to provide database style access to
binary files store on the local file system.
DigestDB is internally comprised of two parts:
• a SQLite database that stores blob categories and the SHA-256 hashes of binary blobs.
• a filesystem directory structure for storing the binary blobs in filenames that match the hash digest of the blob.
DigestDB is written using Python 3.5 and is licensed under the MPL license.
Contents
1
digestdb Documentation, Release 16.08.01
2
Contents
CHAPTER 1
User Guide
This section of the documentation provides some background information about DigestDB as well as step-by-step
instructions for getting the most out of DigestDB.
1.1 Installation Guide
This part of the documentation covers how to install digestdb.
The digestdb project requires Python3.5+ and has some third party dependencies.
1.1.1 Pip
The simplest way to install digestdb is using Pip:
$ pip install digestdb
This will install digestdb and all of its dependencies.
1.1.2 Get the code
You can clone the repository:
$ git clone https://github.com/claws/digestdb.git
$ cd digestdb
$ pip install -e .
1.2 Quickstart
1.2.1 Database
To start using the DigestDB you first need to create a database. Let’s create a DigestDB and we’ll tell it to use the
current directory for storing any binary content.
from digestdb import DigestDB
db = DigestDB('.')
3
digestdb Documentation, Release 16.08.01
By default the DigestDB will create a file called digestdb.db and a directory called digestdb.data. The
digestdb.db is a simple SQLite database that stores the categories and digests of the blobs. Categories are used to
group binary content to facilitate searching (e.g. JavaScript, css, images, etc). The digestdb.data directory is the
top level directory in which all the binary blobs are stored.
When the DigestDB is instantiated it checks for a lock file. The lock file ensure that it has exclusive access to the
data otherwise there is a risk of losing synchronisation between the files on disk and those listing in the database. If
the DigestDB encounters a lock file when starting up it will report the error and shut down.
Before writing or reading data from the DigestDB it must first be opened.
db.open()
Conversely, when you are finished with the database it must be closed.
db.close()
If you re-open the database it will simply continue on from where it left off.
If you want to create a new database you can explicitly specify filename and data_dir.
The DigestDB takes a number of optional arguments. The dir_depth is one of the most important settings and is
disucssed in detail in the following section.
Database Depth
To understand the database directory depth we need some background first.
There isn’t any real limit on the number of files that can be stored in a directory. However, it can become very slow as
the number of files increases. The time it takes to list and check for the existance of a file increases as the number of
files increase. So we need a strategy to balance the files over some number of direcotries to avoid this problem.
The file path that determines where a blob is stored will be created from the blob’s hash. As a new item is added to
the database a hash (SHA-256 by default) is calculated. The default DigestDB dir_depth is 3. This means that
the first three bytes from the hash digest are used to construct the directory structure.
Given the following hash:
8fdd8b7dfa0d7d4f761da78e76d62ec4bee3b1847a6ad48507090e13752b2d
A dir_depth of 1 would result in the data item being stored in the following locaiton:
8f/8fdd8b7dfa0d7d4f761da78e76d62ec4bee3b1847a6ad48507090e13752b2d
A dir_depth of 3 would result in the data item being stored in the following locaiton:
8f/dd/8b/8fdd8b7dfa0d7d4f761da78e76d62ec4bee3b1847a6ad48507090e13752b2d
Each directory level adds 256 directories (x00, x01, ... xfe, xff). So with a directory depth of 1 we get 256 directories.
With a depth of 2 we get 256 * 256 = 65536 and with a depth of 3 we get 256 * 256 * 256 = 16,777,216 directories.
The chosen directory depth can significantly impact cleanup operations. Let’s assume a naive internal implementation
that creates all directories up front. Without storing any data files at all and a depth=1 it takes about 0.03 seconds.
When depth=2 it takes about 10 seconds to remove the 65 thousand directories. When depth=3 it takes a very
long time (2441 secs) to remove the 16 million directories.
For this reason, directories are created only when required. This significantly reduces the time it takes to remove
transient databases, such as those used in unit tests.
The number of directories used to balance the data is related to the total number of data items that are expected to be
stored in the database. By default the depth is 3. This is suitable for storing lots (billions) of data files.
4
Chapter 1. User Guide
digestdb Documentation, Release 16.08.01
As an example, let’s say we plan on having around 10 million files in the database. The following table shows the
expected files in each directory for different directory depth settings.
depth
0
1
2
3
directories
1
256
65,536
16,777,216
files per dir
10,000,000.0
39,062.5
152.5
0.6
In this example a depth of 2 would be appropriate.
The maximum entries in a database for a column with a primary key of a signed integer is 2,147,483,647. So let’s
bump the expected file items up to 2 billion.
depth
0
1
2
3
directories
1
256
65,536
16,777,216
files per dir
2,000,000,000.0
7,812,500.0
30,517.6
119.2
In this example a depth of 3 seems more appropriate.
1.2.2 Categories
Categories provide a method to group associated data items in the database. This provides a mechansim for more
efficient querying of data by category.
The selection of what constitutes a category depends on the scenario. Below are some examples of how categories
might be used to group different kinds of data:
• when storing inter-process messages (e.g. for later analysis or replay) the categories might be the message kinds
or identifiers.
• when storing web requests the categories might be route paths.
• when storing web server resources the categories might represent images, css, javascript, etc.
Categories must be added to the database before data items can be associated with the category.
db.put_category(
label='js', description='JavaScript resources')
1.2.3 Blobs
Binary data can be stored, retrieved and queried.
To add a binary blob to the database use put_data:
digest = db.put_data('js', b'\x00\x01...')
To add the contents of a file to the database use put_file:
digest = db.put_file('js', '/path/to/js/file')
To check if data exists in the database use exists:
data = db.exists(digest)
To fetch data from the database use get_data:
1.2. Quickstart
5
digestdb Documentation, Release 16.08.01
data = db.get_data(digest)
To delete data from the database use delete_data:
data = db.delete_data(digest)
To query data from the database use query_data:
blobs = db.query_data(category='js')
6
Chapter 1. User Guide
CHAPTER 2
API
If you are looking for information on a specific function, class or method, this part of the documentation is for you.
2.1 digestdb package
2.1.1 Submodules
digestdb.database module
digestdb.hashify module
digestdb.model module
7
digestdb Documentation, Release 16.08.01
8
Chapter 2. API
CHAPTER 3
Developers Guide
The project is hosted on GitHub. and uses Travis for Continuous Integration.
If you have found a bug or have an idea for an enhancement that would improve the library, use the bug tracker.
3.1 Setup
The best way to work on digestdb is to create a virtual env. This isolates your work from other project’s dependencies
and ensures that any commands are pointing at the correct tools.
Note: In the following example python is assumed to be the Python 3.5 executable. You may need to explicitly
specify this (e.g. use python3) if you have multiple Python’s available on your system.
$
$
$
$
python -m venv myvenv
cd myvenv
source bin/activate
cd ..
To exit the virtual environment simply type deactivate.
Note: The following steps assume you are operating in a virtual environment.
3.2 Get the source
If you don’t care about contributing back and just want the source then use:
$ git clone [email protected]:claws/digestdb.git
If you want to contribute changes back to the project you should first create a fork of the digestdb repository and then
clone the repo to your file-system. Replace USER with your Github user name.
$ git clone [email protected]:USER/digestdb.git
9
digestdb Documentation, Release 16.08.01
3.3 Install Dependencies
Install the developmental dependencies using pip.
$ cd digestdb
$ pip install -r requirements.dev.txt
3.4 Install digestdb
Use pip to perform a development install of digestdb. This installs the package in a way that allows you to edit the
code after its installed so that any changes take effect immediately.
$ pip install -e .
3.5 Test
The easiest method to run all of the unit tests is to run the make test rule from the top level directory. This runs the
standard library unittest tool which discovers all the unit tests and runs them.
$ make test
Or, you can call the standard library unittest module directly.
$ python -m unittest discover -s tests -v
Individual unit tests can be run using the standard library unittest package too.
$ cd digestdb/tests
$ python -m unittest test_database_db_dir
3.6 Type Annotations
The code base has been updated with type annotations. These provide helpful gradual typing information that can
improve how easily the code is understood and which helps with any future enhancements.
The type annotations checker mypy currently runs cleanly with no warnings.
Use the Makefile convenience rule to check no issues are reported.
$ make check_types
3.7 Documentation
To rebuild the project documentation, developers should run the make docs rule from the top level directory. It
performs a number of steps to create a new set of sphinx html content.
$ make docs
To quickly view the rendered docs locally as you are working you can use the simple Python web server.
10
Chapter 3. Developers Guide
digestdb Documentation, Release 16.08.01
$ cd docs
$ python -m http.server
Then open a browser to the docs content.
3.8 Version
digestdb uses a three segment CalVer versioning scheme comprising a short year, a zero padded month and then
a micro version. The YY.MM part of the version are treated similarly to a SemVer major version. When backwards
incompatible or major functional changes occur the YY.MM will be rolled up. For all other minor changes only the
micro part will be incremented.
3.9 Release Process
The following steps are used to make a new software release:
• Update the version label in __init__.py. It must comply with the Version scheme.
• Create the distribution
make dist
• Test distribution in dist/ directory. This involves creating a virtual environment, installing the source distribution in it and running the tests. These steps have been captured for convenience in the dist/test.bash
helper script. The script takes the distribution archive as its only argument.
cd dist
./test.bash digestdb-16.08.01.tar.gz
cd ..
• Build the docs and check for any errors.
make docs
• Upload to PyPI using
make dist.upload
or manually using:
python setup.py sdist upload
• Create and push a repo tag to Github.
git tag YY.MM.MICRO -m "A meaningful release tag comment"
git tag # check release tag is in list
git push --tags origin master
– Github will create a release tarball at:
https://github.com/{username}/{repo}/tarball/{tag}.tar.gz
3.8. Version
11

Download Report

digestdb Documentation

Paperzz.com

Your Paperzz