To start, some OpenEdge RDBMS storage hierarchy:
- an OpenEdge database comprises a collection of logical containers called storage areas;
- each storage area comprises a collection of one or more physical containers (files in the file system) called extents
- exceptions:
- the control area, aka dbname.db, which has only one extent
- after-image areas each have one extent
- each "storage object" is assigned to one and only one storage area
- a storage object can be:
- a table (non-partitioned)
- an index
- a BLOB or CLOB column
- a table partition (in a multi-tenant or partitioned table)
- each partition-specific b-tree of a partition-aligned index of a partitioned table, where it has been defined as a local index
- in a Type II storage area (i.e. a data-storage area where the area type (_area._area-type) is 6 and the cluster size (_area._cluster-size) is 8 or 64 or 512), a storage object is a chain of one or more clusters of database blocks, only containing data for that one storage object
- a cluster is a collection of logically-contiguous database blocks
- all data-storage blocks (i.e. blocks in areas excluding the primary recovery area (aka before-image "file") and after-image areas) are the same size: the database block size (_area._area-blocksize)
All physical I/O in the database is done at the block level. If a client wants to read a particular index key, or sequence value, or record, the entire block containing that logical entity must be read into the buffer pool, if it is not already there. (A record may be fragmented, i.e. it may be in two or more pieces in separate blocks; when it is accessed, all blocks containing its fragments must be read into the buffer pool so the entire record may be assembled, written to a record buffer, and passed to the client.) Similarly, when a block in the buffer pool has been modified and must be written to disk, the entire block is written back to disk.
All logical I/O is done via the buffer pool. Clients and servers do not read or modify data directly on disk. Read requests go to the storage engine which fetches the necessary block(s) from disk and loads them into the buffer pool, at which point their contents may be accessed and returned to the requesting client. Similarly, writes are done by modifying the blocks in the buffer pool; those modified buffers are later written to disk asynchronously by an asynchronous page writer (APW) or server or self-service client.
As alluded to above, when a client wants to access a block, it is read into the buffer pool
if necessary. Therefore the storage engine must have an efficient mechanism to determine whether a particular block is or is not currently in the buffer pool. (In this context, "the buffer pool" means either the primary or alternate buffer pool.) A block is uniquely identified by two pieces of information: the number of the storage area in which it resides, and its dbkey, which is a unique physical block address within that storage area.
One could imagine a naive mechanism by which a block could be found, by scanning the header of each block in the buffer pool, until either the desired block was found (a "cache hit") or the end of the chain of blocks in the buffer pool was reached (a "cache miss"). But clearly this would not be efficient, as the cost of such a scan would grow linearly with the size of the buffer pool (the sum of -B and -B2). Adding to that computational cost would be the concurrency penalty that no other process could be allowed to modify that chain of blocks while the requesting process was scanning the chain, to ensure a consistent state during the read operation.
Instead a data structure called a
hash table is used. In this scheme, a collection of entities (in this case, database blocks) is categorized using metadata that is unique to each entity (in this case, the combination of storage area number and dbkey) and it is assigned a place (an "index") in the table. This unique value is passed through a hash function to determine the index value for that block. The entry in the hash table at that index value consists of a linked list of block headers. Each block header contains some metadata about that block in the buffer pool and its state (e.g. which queue it is on, if any, whether it has been modified, etc.).
When a server (or self-service client) wants to access a block, it already knows the block's area number (from its _storageobject record) and its dbkey. It can then compute the hash value for that block, allowing it to go directly to that offset in the buffer hash table. That hash table entry consists of a relatively short (more on that below) linked list of zero or more block headers that can then be scanned sequentially to find the desired block. If it is found, it is a cache hit and the data can be accessed from the buffer. If it is not, the block is not in the buffer pool and it must be read in by the storage engine (making room for it by evicting a buffer from the end of the LRU or LRU2 chain as appropriate, if necessary). Thus each buffer lookup in the buffer hash table (BHT) consists of a quick indexed lookup followed by a short, relatively fixed-length sequential scan, regardless of the size of the buffer pool.
The reason why the sequential scan of a given hash chain is relatively short is that the size of the BHT is a function of the size of the buffer pool. The size of the BHT is determined by the value of the -hash primary broker startup parameter. (While this means you can set it manually, you shouldn't; let the broker do it.) I don't know the exact formula for determining the default value of -hash, but it is something like the next prime number larger than (-B + -B2) / 4. In other words, the BHT has roughly a quarter as many entries as the buffer pool, regardless of the size of the buffer pool. Assuming a good hash function that gives a uniform distribution of hash values based on arbitrary inputs, this means that even when the buffer pool is full, each hash chain would have about four entries on average, making that sequential scan quick.
While this algorithm is vastly more efficient than the naive mechanism described above, it still requires consistent reads. But concurrency is much better because only the single hash chain being scanned must be locked by the process that is reading it, not the entire table. It does this by obtaining a lock on the BHT latch that is associated with that chain. In recent OpenEdge releases up to and including 11.7.2, there is a fixed number of 1024 BHT latches for the buffer hash table, so each latch protects about (-hash / 1024) hash chains.
But there can still be increased contention as the buffer pool grows, because the number of BHT latches does not, so the ratio of hash chains to latches increases. So if two users are trying to find two different buffers at the same time, but those buffers hash to different but adjacent hash values protected by the same latch, they will both try to lock it and (at least) one will have to wait.
An enhancement was added to OE 11.7.3 and later to address this potential problem by making the number of BHT latches configurable. It is now a percentage of -hash and that percentage is determined by the value of a new primary broker startup parameter -hashLatchFactor. It defaults to 10%. So imagine a database started with -B 6,000,000. In 11.7.2 or earlier, -hash will be roughly 1,500,000, or over 1400 hash chains per BHT latch. By contrast, the same database started with 11.7.3 or later would have about 150,000 BHT latches, a dramatic increase. This could help with concurrency in random data access, assuming a busy system that previously suffered from BHT latch contention.
Further reading:
OpenEdge 12.0 Database Performance and Server-Side Joins
http://pugchallenge.org/downloads2018/Banville_Performance.pdf
KB:
Is there a parameter to reduce the BHT latch contention?
https://knowledgebase.progress.com/articles/Article/000040267/p
11.7.3 startup parameter -hashLatchFactor
https://knowledgebase.progress.com/articles/Article/1173-startup-parameter-hashLatchFactor/p