Class CharsToNameCanonicalizer
For optimal performance, usage pattern should be one where matches
should be very common (especially after "warm-up"), and as with most hash-based
maps/sets, that hash codes are uniformly distributed. Also, collisions
are slightly more expensive than with HashMap or HashSet, since hash codes
are not used in resolving collisions; that is, equals() comparison is
done with all symbols in same bucket index.
Finally, rehashing is also more expensive, as hash codes are not
stored; rehashing requires all entries' hash codes to be recalculated.
Reason for not storing hash codes is reduced memory usage, hoping
for better memory locality.
Usual usage pattern is to create a single "master" instance, and either use that instance in sequential fashion, or to create derived "child" instances, which after use, are asked to return possible symbol additions to master instance. In either case benefit is that symbol table gets initialized so that further uses are more efficient, as eventually all symbols needed will already be in symbol table. At that point no more Symbol String allocations are needed, nor changes to symbol table itself.
Note that while individual SymbolTable instances are NOT thread-safe (much like generic collection classes), concurrently used "child" instances can be freely used without synchronization. However, using master table concurrently with child instances can only be done if access to master instance is read-only (i.e. no modifications done).
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected org.codehaus.jackson.sym.CharsToNameCanonicalizer.Bucket[]
Overflow buckets; if primary doesn't match, lookup is done from here.protected final boolean
Whether any canonicalization should be attempted (whether using intern or not)protected boolean
Flag that indicates if any changes have been made to the data; used to both determine if bucket array needs to be copied when (first) change is made, and potentially if updated bucket list is to be resync'ed back to master instance.protected int
Mask used to get index from hash values; equal to_buckets.length - 1
, when _buckets.length is a power of two.protected final boolean
Whether canonical symbol Strings are to be intern()ed before added to the table or notprotected int
We need to keep track of the longest collision list; this is needed both to indicate problems with attacks and to allow flushing for other cases.protected CharsToNameCanonicalizer
Sharing of learnt symbols is done by optional linking of symbol table instances with their parents.protected int
Current size (number of entries); needed to know if and when rehash.protected int
Limit that indicates maximum size this instance can hold before it needs to be expanded and rehashed.protected String[]
Primary matching symbols; it's expected most match occur from here.protected static final int
Default initial table size.static final int
protected static final int
Let's not expand symbol tables past some maximum size; this should protected against OOMEs caused by large documents with uniquer (~= random) names. -
Method Summary
Modifier and TypeMethodDescriptionfinal int
_hashToIndex
(int rawHash) Helper method that takes in a "raw" hash value, shuffles it as necessary, and truncates to be used as the index.int
Method for checking number of primary hash buckets this symbol table uses.int
calcHash
(char[] buffer, int start, int len) Implementation of a hashing method for variable length Strings.int
int
Method mostly needed by unit tests; calculates number of entries that are in collision list.static CharsToNameCanonicalizer
Method called to create root canonicalizer for aJsonFactory
instance.protected static CharsToNameCanonicalizer
createRoot
(int hashSeed) findSymbol
(char[] buffer, int start, int len, int h) int
hashSeed()
makeChild
(boolean canonicalize, boolean intern) "Factory" method; will create a new child instance of this symbol table.int
Method mostly needed by unit tests; calculates length of the longest collision chain.boolean
void
release()
protected void
reportTooManyCollisions
(int maxLen) int
size()
-
Field Details
-
HASH_MULT
public static final int HASH_MULT- See Also:
-
DEFAULT_TABLE_SIZE
protected static final int DEFAULT_TABLE_SIZEDefault initial table size. Shouldn't be miniscule (as there's cost to both array realloc and rehashing), but let's keep it reasonably small nonetheless. For systems that properly reuse factories it doesn't matter either way; but when recreating factories often, initial overhead may dominate.- See Also:
-
MAX_TABLE_SIZE
protected static final int MAX_TABLE_SIZELet's not expand symbol tables past some maximum size; this should protected against OOMEs caused by large documents with uniquer (~= random) names.- See Also:
-
_parent
Sharing of learnt symbols is done by optional linking of symbol table instances with their parents. When parent linkage is defined, and child instance is released (call torelease
), parent's shared tables may be updated from the child instance. -
_intern
protected final boolean _internWhether canonical symbol Strings are to be intern()ed before added to the table or not -
_canonicalize
protected final boolean _canonicalizeWhether any canonicalization should be attempted (whether using intern or not) -
_symbols
Primary matching symbols; it's expected most match occur from here. -
_buckets
protected org.codehaus.jackson.sym.CharsToNameCanonicalizer.Bucket[] _bucketsOverflow buckets; if primary doesn't match, lookup is done from here.Note: Number of buckets is half of number of symbol entries, on assumption there's less need for buckets.
-
_size
protected int _sizeCurrent size (number of entries); needed to know if and when rehash. -
_sizeThreshold
protected int _sizeThresholdLimit that indicates maximum size this instance can hold before it needs to be expanded and rehashed. Calculated using fill factor passed in to constructor. -
_indexMask
protected int _indexMaskMask used to get index from hash values; equal to_buckets.length - 1
, when _buckets.length is a power of two. -
_longestCollisionList
protected int _longestCollisionListWe need to keep track of the longest collision list; this is needed both to indicate problems with attacks and to allow flushing for other cases.- Since:
- 1.9.9
-
_dirty
protected boolean _dirtyFlag that indicates if any changes have been made to the data; used to both determine if bucket array needs to be copied when (first) change is made, and potentially if updated bucket list is to be resync'ed back to master instance.
-
-
Method Details
-
createRoot
Method called to create root canonicalizer for aJsonFactory
instance. Root instance is never used directly; its main use is for storing and sharing underlying symbol arrays as needed. -
createRoot
-
makeChild
"Factory" method; will create a new child instance of this symbol table. It will be a copy-on-write instance, ie. it will only use read-only copy of parent's data, but when changes are needed, a copy will be created.Note: while this method is synchronized, it is generally not safe to both use makeChild/mergeChild, AND to use instance actively. Instead, a separate 'root' instance should be used on which only makeChild/mergeChild are called, but instance itself is not used as a symbol table.
-
release
public void release() -
size
public int size() -
bucketCount
public int bucketCount()Method for checking number of primary hash buckets this symbol table uses.- Since:
- 1.9.9
-
maybeDirty
public boolean maybeDirty() -
hashSeed
public int hashSeed() -
collisionCount
public int collisionCount()Method mostly needed by unit tests; calculates number of entries that are in collision list. Value can be at most (size()
- 1), but should usually be much lower, ideally 0.- Since:
- 1.9.9
-
maxCollisionLength
public int maxCollisionLength()Method mostly needed by unit tests; calculates length of the longest collision chain. This should typically be a low number, but may be up tosize()
- 1 in the pathological case- Since:
- 1.9.9
-
findSymbol
-
_hashToIndex
public final int _hashToIndex(int rawHash) Helper method that takes in a "raw" hash value, shuffles it as necessary, and truncates to be used as the index. -
calcHash
public int calcHash(char[] buffer, int start, int len) Implementation of a hashing method for variable length Strings. Most of the time intention is that this calculation is done by caller during parsing, not here; however, sometimes it needs to be done for parsed "String" too.- Parameters:
len
- Length of String; has to be at least 1 (caller guarantees this pre-condition)
-
calcHash
-
reportTooManyCollisions
protected void reportTooManyCollisions(int maxLen) - Since:
- 1.9.9
-