hstore.rb - This is a Hash Store
Copyright: (C) 2000, 2001, 2002, 2003 Seva Inc. and Seva Software - www.sevasoftware.com
License: Same as Ruby
Download: www.sevasoftware.org/hstore/hstore_0_91.tar.gz
Installation procedures:
tar -zvxf hstore_0_91.tar.gz
cd hstore_0_91
ruby install.rb
An HStore stores and retrieves
key/value pairs to and from disk similar to a Pstore. An HStore uses two files, an index file
containing a Hash of keys and the file position for each object and a data
file containing the values in the HStore. The index file is cached in a
Hash in memory for fast look-ups. The values are stored in a separate data
file and are individually retrieved from disk with [] or fetch() and
written back to disk with []= or update()/insert(). The following files are
used/created by HStore:
- filename.hs - this contains the actual data or the value portion of the
key/value pairs. Each value is Marshal.dumped to this file. Backup this
file regularly.
- filename.hsi - this contains the index or Hash of keys/file pointer to
filename.hs. This file also contains the free_space array. Backup this
file regularly.
- filename.hsi.bak - this is the previous version of filename.hsi. This is
created just before the first update is written to the HStore in case of a failure. Here are
some interesting notes about the backup copy:
- If the process dies in the middle of saving or committing your changes to
the HStore, the HStore could be corrupt because the
saving process did not complete. If this happens, the next time you attempt
to connect to the HStore you will get
an error telling you to restore the HStore using the backup copy. Typically,
this can be fixed by copying filename.hsi.bak to filename.hsi.
- If the process dies when not in the middle of saving the HStore, any updates to the HStore since the last commit() or open()
will be lost but the HStore will be
okay, though it could have some extra unused space left in the file. The
unused space can be eliminated by calling compact_hstore.rb.
- filename.hs.old_hs - this is created when an older HStore is encounter, it contains the
original data. This is also created when call compact to store a temporary
copy of the HStore data file while it
is compacted.
- filename.hs.00? - where ? is a number. These are exports of the HStore. This is created when calling
export(nil). A new sequential number is used every time export(nil) is
called allowing you to keep several backups of your HStore. An error is raised if you attempt
to open an HStore that did not get
properly saved.
The premise is that the index Hash (filename.hsi) is rewritten only when a
new value is inserted, a value is relocated because it no longer fits in
it’s current space, or a value is deleted. The values are
individually read and written using Marshal.load/dump to/from filename.hs.
This should reduce the memory needed by an HStore to the size of the key + pos of
the object. The I/O is faster because less data is read/written during a
transaction.
The HStore is thread safe and
multi-process safe (only if File.flock is supported, it works on linux and
Windows 2000). Many concurrent processes or threads can read from an HStore at the same time. Only one process
or thread can write to the HStore at
the same time. If one process or thread changes the HStore, other readers will reload the
index hash at the beginning of the next transaction. Example:
hs.initialize(filename)
hs.open{|store| store[key] = value} # close() is automatically called when open is called with a block
or
hs.initialize(filename).open
hs[key] = value # automatically starts an implied transaction
hs.transaction{|store| store[key] = value} # transactions load data only if it has been changed by another process or thread and yields the HStore
hs.transaction{|store| store.delete(key)}
hs.close
Starting a transaction will automatically reloads an updated HStore only when needed.
The HStore is similar to a Pstore.
Here are the basic differences:
- Pstore reads and writes all key/value pairs with every transaction update.
- HStore reads/writes specific elements
at time.
- HStore is multi-thread safe and
multi-process safe.
There is a module called tst/convert_pstore_to_hstore.rb that will read the
content of a Pstore and create a HStore.
When to use an HStore/Pstore verses a
database server:
- When to use a database server such as ArunaDB or PostgreSQL:
- You have a lot of data, possible more than 100,000 rows or records
(key/value pairs)
- When you have relationships between two or more sets of data
- When you want triggers and stored procedures to insure data integrity
- When to use an HStore:
- When you have a lot of data and typically access a small portion of that
data at any one time
- The HStore should be faster when
accessing small portions of the data at any given time.
- When to use a Pstore:
- When you typically need access to most of the data every time you interact
with the data.
- The Pstore should be faster when accessing most of the data at any given
time.
Pros compared to a Pstore:
- The HStore stores all keys and a file
pointer in a Hash in memory. The Pstore stores all key/value pairs in
memory.
- Opening the HStore is faster in most
cases because only the key/file pointer pairs are loaded. The Pstore loads
all key/value pairs.
- The HStore is faster when updating a
few key/value pairs because only value is individually written to disk
where the Pstore writes all key/value pairs to disk.
- The HStore is faster when deleting a
few key/value pairs in most cases because only the key/file_pointer Hash is
written while the Pstore writes all key/value pairs.
- The HStore is multi-thread and
multi-process safe (only if File.flock is supported, it’s been tested
on linux, freebsd, and Windows).
Cons compared to a Pstore:
- The HStore uses two files, one for
data and another for the index Hash of keys where the Pstore uses only one
file.
- The HStore could be slower when
inserting or deleting a lot of new key/value pairs because the Pstore
writes all key/value pairs once, the HStore writes each key/value pair
individually.
- The Pstore is faster than an HStore
when updating most of the keys in the Pstore.
Create the HTML documentation (required RDoc):
rdoc --title HStore --template kilmer hstore.rb hstore_backup.rb hstore_restore.rb hstore_compact.rb
Support
Testing
- See tst/tst_hstore.rb. This runs several tests including a multi-thread
test.
- See tst/tst_filelock.rb. This tests the file locking mechanism used by the
hstore. I ran that on several different computers all accessing the same
file.
Status
Todo
- I plan to add a DBI driver that supports the following basic operations:
- connect/open
- disconnect/close
- select * from hstore_name # single hstore or table only
- select field1, field2 from hstore_name where field2 =~ ‘value’
History
- 10/15/2003 - version 0.91
- Eliminated the padding and improved the efficiency of the HStore
- Changed how free space is tracked for greater efficiency.
- The index Hash only stores the position rather than [pos, size] for greater
efficiency.
- All methods will automatically create an implied transaction if one has not
been started.
- Opening an old style HStore is now
allowed and is automatically upgraded.
- The size of each buffer is now stored in the data file just before the
position of the buffer, this is need to track free space.
- Changes to the interface:
- converted the parameters in open to a Hash
- added a read_only parameter to transaction
- eliminated show_free_space()
- the return value of free_space() is now an Array of
[count of free block, total size of free blocks]
- added compact() to eliminate the unused space in an HStore
- 09/23/2003 - version 0.90
- Changed the locking mechanism a little to fix a failure with the windows
version of Ruby on XP.
- Changed the internal HStore version
from a string to a number between 0 and 255 for a little better
performance. It is the first character in the data file followed by the
descriptor.
- The put_descriptor method now only replaces the current description if the
new descriptor is smaller than the old one.
- 10/26/2002 - first initial version, 0.80
- Added a descriptor to the beginning of the data file so you can store info
about the HStore such as a description
and version in the HStore.
- The hash is only loaded once during the first open() and is cached in
memory between calls to open() and close(). This should provides must
faster access to the HStore when the
HStore is not always open.
- Added multi-thread and multi-process support (only if File.flock() is
supported, works on linux and Windows)
- 3/19/2002 - Started and Completed.