Interview Key: Java Collections

Iterable defines the contract and foreach statement. Iterable also provides a uniform way of accessing collection elements sequentially.

Collection contains the core functionality required of any collection other than a map.
Set is a collection, without duplicates. Sorted Set automatically sorts its elements and returns them in order.
NavigableSet extends Sorted Set by adding methods to find the closest matches to a target element.
Queue is a collection designed to accept elements at its tail for processing, yielding them up at its head in the order in which they are to be processed.
Deque extends Queue by allowing elements to be added or removed at both head and tail.
BlockingQueue and BlockingDeque are subinterfaces of Queue and Deque, and support concurrent access and block threads indefinitely or till timeout, until the requested operation can be carried out.
List is a collection in which order is significant, accommodating duplicate elements.
Map is a collection which uses key-value associations to store and retrieve elements. It is extended by ConcurrentMap, which provides support for concurrent access, by SortedMap, which guarantees to return its values in ascending key order, by Navigable-Map which extends SortedMap to find the closest matches to a target element, and by ConcurrentNavigableMap which extends ConcurrentMap and NavigableMap.

The Collections class provides wrapper objects that modify the behavior of standard collections classes in one of three ways—by synchronizing them, by making them unmodifiable, or by checking the type of elements being added to them.

Iterators of collections throw ConcurrentModificationException whenever they detect that the collection from which they were derived has been structurally changed i.e. elements have been added or removed. Iterators are implemented as a view of their underlying collection so, if that collection is structurally changed, the iterator may well not be able to continue operating correctly when it reaches the changed part of the collection. The policy of iterators for collections is fail fast.
Note: Using the iterator to iterate the collection and trying to remove the element using the collection's remove() method still throws ConcurrentModificationException. The remove() method of the iterator itself should only be used to avoid any exceptions.

Arrays: Arrays provide random-access memory with fast element access by position but slow insertion or removal of elements. They are used as the backing structure for ArrayList, CopyOnWriteArrayList, EnumSet and EnumMap, and for many of the Queue and Deque implementations.

Linked lists: It consist of chains of linked cells, cell containing data and reference to next cell in the list. Accessing elements by position is slow, but insertion and removal operations can be performed in constant time.

Hash tables: It stores elements indexed on their content rather than on an integer-valued index. Hash tables are the backing structure
for many Set and Map implementations, including HashSet and LinkedHashSet together with the corresponding maps HashMap and LinkedHashMap.

Trees: It organizes the elements by content and can store or retrieve them in sorted order. Trees are the backing structures for TreeSet and TreeMap.

Adding an element to the end of an ArrayList can normally be done in constant time, unless the ArrayList has reached its capacity. In such case, a new and larger array must be allocated, and the contents of the old array transferred into it. The cost of this operation is linear in the number of elements in the array, but its occurrence is relatively rare. The amortized cost is the total cost of performing the operation n times divided by n, taken to the limit as n becomes arbitrarily large. Hence amortized cost of adding an element to an ArrayList is, the total cost for N elements is O(N), so the amortized cost is O(1).

Some Collection API methods return collections with restricted functionality. The set of keys obtained from a Map can have elements removed but not added. Others e.g. the list view returned by Arrays.asList can have elements neither added nor removed or may be completely read-only.

Thread Safety: Legacy collections can be made thread safe by synchronizing the critical section of code. Before a thread can execute synchronized code, it has to get the lock on the current object. Any other thread trying to enter critical section will be blocked on the lock and suspended until the lock is released. The concurrent collections achieve thread safety by using copy-on write, compare-and-swap and concurrent locks.

Copy-on-Write: Classes that use copy-on-write store their values in an internal array, which is effectively immutable; any change to the value of the collection results in a new array being created to represent the new values. Copy-on-write is used by the collection classes CopyOnWriteArrayList and CopyOnWriteArraySet.

Compare-and-Swap (CAS): Consider a computation were the value of a single variable is used as input to a long-running calculation whose eventual result is used to update the variable. In CAS, it makes a local copy of the variable and performs the calculation without getting exclusive access. Only when it is ready to update the variable does it call CAS, which in one atomic operation compares the variable’s value with its value at the start and, if they are the same, updates it with the new value. If they are not the same, the variable must have been modified by another thread. Collections using CAS include ConcurrentLinkedQueue and ConcurrentSkipListMap.

Concurrent Locks: A thread can acquire locks under special conditions, only if the lock is not currently held, or with a timeout, or if the thread is not interrupted. Unlike synchronized code, in which an object lock is held while a code block or a method is executed, a Lock is held until its
unlock method is called. Hence collection can be divided into parts so that they can be separately locked, giving improved concurrency. LinkedBlockingQueue and ConcurrentHashMap are collections using locks.

Set: A set is a collection of items that cannot contain duplicates; adding an item if it is already present in the set has no effect.

HashSet: It is implemented by a hash table, an array in which elements are stored at a position derived from their contents. An element’s position in a hash table is calculated by a hash function of its contents. Hash functions are designed to give, as far as possible, an even spread of results (hash codes) over the element values that might be stored. Collections Framework classes use bit masking to obtain an index from the hash code.
When two distinct values hash to the same location in the hash table, collisions occur in hashing. When collisions do occur, we have to have a way of keeping the colliding elements at the same table location or bucket, often done by storing them in a linked list.

HashSet is actually implemented by a specialized HashMap, each of the cells in the chain contains not one but two references, to a key and a value. When a hash table is being used to represent a set, all values are the same, only the presence of the key is significant.

If a collision does take place, a linked list has to be created and subsequently traversed, adding an extra cost to insertion proportional to the number of elements in the list. The table size is increased by rehashing—copying to a new and larger table—when the load reaches a specified threshold (its load factor) to avoid worsening of performance.

Iterating over a hash table requires each bucket to be examined to see whether it is occupied and therefore costs a time proportional to the capacity of the hash table plus the number of elements it contains. Since the iterator examines each bucket in turn, the order in which elements are returned depends on their hash codes, so there is no guarantee as to the order in which the elements will be returned.

The chief attraction of a hash table implementation for sets is the (ideally) constanttime performance for the basic operations of add, remove, contains, and size. Its main disadvantage is its iteration performance; since iterating through the table involves examining every bucket, its cost is proportional to the table size regardless of the size of the set it contains.
HashSet is unsychronized and not thread-safe; its iterators are fail-fast.

LinkedHashSet: It uses a doubly linked list internally and a hash table. LinkedHashSet inherits from HashSet and guarantees that its iterators will return their elements in the order in which they were first added. It does this by maintaining a linked list of the set elements.

The linked structure has an improved performance for iteration, were the next performs in constant time, as the linked list can be used to visit each element in turn. It is in contrast to HashSet were every bucket in the hash table must be visited whether it is occupied or not, but adds the overhead of maintaining the linked list. LinkedHashSet is chosen in preference to HashSet only if the order or the efficiency of iteration is more important. LinkedHashSet is unsychronized and not threadsafe and its iterators are fail-fast.

CopyOnWriteArraySet: It is implemented as a thin wrapper around an instance of CopyOnWriteArrayList, which in turn is backed by an array. The array is treated as immutable; any change to the contents of the set results in an entirely new array being created. Hence add or contains operations has complexity O(n) as its implemented by a linear search. CopyOnWriteArraySet should not be used were many searches or insertions are expected. The backing array implementation means that iteration costs O(1) per element faster than HashSet, and it provides thread safety without adding any cost for read operations. Commonly used in Subject-Observer design pattern requiring events to be notified to a set of observers. This set must not be modified during the process of notification; which is ensured by CopyOnWriteArraySet with the overhead carried entirely by write operations. Iterators are only allowed to read the set and remove method is not implemented, thus called snapshot iterators.
CopyOnWriteArraySet should only be used where set size will remain relatively small, read operations greatly outnumber writes and thread safety is required.

EnumSet: Implementation were the number of possible elements is fixed and a unique index can be assigned to each elements. The number of keys is fixed by the constants of the enumerated type, and the ordinal method returns values that are guaranteed to be unique to each constant. The add, remove, and contains methods are implemented as bit manipulations, with constant-time performance. Bit manipulation on a single word is extremely fast, and a long value can be used to represent EnumSets over enum types with up to 64 values. EnumSet is an abstract class and it hides the concrete implementation by exposing factory methods that call the constructor for the appropriate subclass.

EnumSet obeys the contract for Set, with the added specification that its iterators will return their elements in the order in which their enum constants are declared. It is not thread-safe and its iterators are not fail-fast. They may be either snapshot or weakly consistent; but the contract guarantees only that they will be weakly consistent.

SortedSet and NavigableSet: It is a subinterface of Set which adds to the Set contract a guarantee that its iterator will traverse the set in ascending element order. TreeSet is the only implementation of SortedSet. Merging two sorted lists of size n is O(n), but adding n elements to a TreeSet of size n is O(n log n). The SortedSet uses the compare method of its Comparator or, if its not present then the compareTo method of its elements is used instead of the elements equals method to determine when elements are distinct.
NavigableSet interface extends SortedSet and adds methods to find the closest matches to a target element. The NavigableSet methods (especially for getting range views, e.g subSet, tailSet etc) allow to specify for each bound whether it should be inclusive or exclusive. NavigableSet returns null values to signify the absence of elements.

TreeSet: Trees are the data structure which provides fast insertion and retrieval of individual elements but which also requires that they be held in sorted order. Trees perform better than list but are slower than hash tables. Their advantage over hash tables being the capability to maintain the ordering or sortedness. The cost of retrieving or inserting an element is proportional to the depth of the tree, since we locate the word by alphabetic comparison at each level starting from the root. A binary tree with n complete levels will have 2n–1 elements. Hence the depth of a tree with n elements will be bounded by log n (since 2^log n = n).

TreeSet uses a data type called a red-black tree, which has the advantage that if it becomes unbalanced through insertion or removal of an element, it can always be rebalanced in O(log n) time. TreeSet is unsychronized and not thread-safe; its iterators are fail-fast.

ConcurrentSkipListSet: It is backed by a skip list, a modern alternative to the binary trees. A skip list for a set is a series of linked lists, each of which is a chain of cells consisting of two fields: one to hold a value, and one to hold a reference to the next cell. Elements are inserted into and removed from a linked list in constant time by pointer rearrangement.

The first linked list of the collection (level 0 in the figure) contains the elements of the set, sorted according to their natural order or by the comparator of the set. Each list above level 0 contains a subset of the list below, chosen randomly according to a fixed probability. For this example, if the probability is 0.5 on average then each list will contain half the elements of the list below it. Navigating between links takes a fixed time, so the quickest way to find an element is to start at the beginning of the top list and to go as far as possible on each list before dropping to the one below it, when the next element is higher than the search value. Elements are removed from skip list by removing it from each level in which it occurs.

The probability is very high that skip lists will give performance comparable to binary trees: search, insertion and removal all take O(log n). Their compelling advantage for concurrent use is that they have efficient lock-free insertion and deletion algorithms, whereas there are none known for binary trees. The iterators of ConcurrentSkipListSet are weakly consistent.

The HashSet, LinkedHashSet and TreeSet are not thread safe, hence should be wrapped in Collection.synchronizedSet for multi-threading. Both HashSet and LinkedHashSet have no sorting. If require frequently iterating over the set or access ordering LinkedHashSet is the choice. If the set should sort its elements then TreeSet and ConcurrentSkipListSet should be use.

Queue: It is a collection to process elements by first in first out basis. Below are the methods included in Queue interface.

boolean add(E e);        // throws exception if currently out of space
boolean offer(E e);        // throws exception if unable to insert (preferable)
E remove();                // removes and returns head, exception if queue is empty
E poll();                // removes and returns head, 'null' if queue is empty
E element();            // returns head, exception if queue is empty
E peek();                // returns head, 'null' if queue is empty

In general, the use of null as a queue element is discouraged by the Queue interface as peek() and poll() methods return null for empty queue, and LinkedList is only standard implementation that allows it. Queue implementation varies mainly with the ordering of retriving the elements.

ArrayDeque: It is used get FIFO ordering and is unbounded.
PriorityQueue: It processes its elements according to the ordering either natural if implement Comparable or custom if implemented by Comparator. It is not thread safe, nor provides blocking behavior. Priority Queue gives no guarantee of how it presents multiple elements with the same value. Priority queues are implemented by priority heaps which is a binary tree similar to TreeSet with two differences.

First constraint is the only ordering constraint is that each node in the tree should be larger than either of its children, and second, that the tree should be complete at every level except possibly the lowest; if lowest level is incomplete, the nodes it contains must be grouped together at the left.
The new element is added to the leftmost vacant position and then it is repeatedly exchanged with its parent until it reaches a parent that has higher priority. When root element with highest priority is removed, it first places the rightmost element from the bottom row into the root position and reverses the add process by repeatedly exchanging with the larger of its children until it has highest priority. Addition and Removal require minimum operations proportional to height of the tree, hence add, remove takes O(log n) time, while contains takes O(n) as it traverses entire tree.

ConcurrentLinkedQueue: It is an unbounded, thread-safe, FIFO-ordered queue. It uses a linked structure. It uses a CAS-based wait-free algorithm—that is, one that guarantees that any thread can always complete its current operation, regardless of the state of other threads accessing the queue. It executes queue insertion and removal operations in constant time, but requires linear time to execute size, as the algorithm relies on co-operation between threads for insertion and removal and does not keep track of the queue size.

BlockingQueue: Primarily used for producer-consumer queues, and ensuring that enqueue and dequeue do not return until they have executed successfully. BlockingQueue methods come in four forms as below, one throws an exception, the second returns a special value (either null or false, depending on the operation), the third blocks the current thread indefinitely until the operation can succeed, and the fourth blocks for only a given maximum time limit before giving up.

Queue Methods Summarization
	Throws exception	Special value	Blocks	Times out
Insert	add(e)	offer(e)	put(e)	offer(e, time, unit)
Remove	remove()	poll()	take()	poll(time, unit)
Examine	element()	peek()	not applicable	not applicable

BlockingQueue guarantees that the queue operations of its implementations will be threadsafe and atomic. But this guarantee doesn’t extend to the bulk operations inherited from Collection—addAll, containsAll, retainAll and removeAll. The five BlockingQueue implementations are LinkedBlockingQueue, ArrayBlockingQueue, PriorityBlockingQueue, DelayQueue and SynchronousQueue.

LinkedBlockingQueue: It is a thread-safe, FIFO-ordered queue, based on a linked node structure. Its choice for an unbounded blocking queue, and better than ArrayBlockingQueue for bounded use. Insertion and removal requires constant time while contains require linear time. Iterators are weakly consistent.

ArrayBlockingQueue: Implementation is based on a circular array. Only the elements at the front and rear could be removed at constant time using the head or the tail of the queue. Insertion and removal of elements in the middle of the queue has time complexity O(n) as all the elements from one end must be moved along to maintain a compact representation. Its constructors provide a collection parameter to be initialized with the contents of the specified collection with the same order. They also provide the alternative to provide a fair scheduling policy which guarantees that the queue will choose the one that has been waiting longest. Ordering is FIFO, insertion and removal requires constant time while contains require linear time. Iterators are weakly consistent.

PriorityBlockingQueue: It is a thread-safe, blocking version of PriorityQueue. Its iterators are failfast.

DelayQueue: Its a specialized priority queue, were ordering is based on the delay time for each element i.e. he time remaining before the element will be ready to be taken from the queue. If all elements have a positive delay time, then poll will return null. The element with the longest-expired delay time will be at the head of the queue. Most queue operations treat a queue with no unexpired elements as if it were empty except for peek and remove.

SynchronousQueue: This queue has no internal capacity. A thread that wants to add an element to a SynchronousQueue must wait until another thread is ready to simultaneously take it off, and same is true for the reverse case. It has a mechanism for synchronizing two threads, allowing them to cooperate by exchanging data. SynchronousQueue behaves like an empty Collection and returns an empty iterator.

Deque: A double-ended queue can insert or remove elements from either end, unlike a queue were insertion can only occur at the rear and removal at rear.
It is a linear structure in which elements added at the tail are yielded up in the same order at the head. It has methods, addFirst and addLast similar to push, removeFistOccurence similar to remove in Collection, offerFirst similar to offer in Queue.

ArrayDeque: It uses a circular array and similar to ArrayBlockingQueue. It can be used as deques and FIFO queues. Add and remove at head/tail take constant time and its iterators are failfast.

LinkdList: Its only implementation of Deque permitting null elements. It has linked list structure and extra field in each cell to point to the previous entry to enable backward traversal. Its advantage is constant insertion and removal time, and its iterators are fail fast.

BlockingDeque: Similar to BlockingQueque adding similar methods for operations at the other end of Deque. Concurrent deques are the basis of one of the best load balancing methods, work stealing (were an idle thread can process tail elements in parallel when head is processed by a thread).
It is implemented based on a doubly linked list structure similar to LinkedList. Queue insertion and removal take constant time and operations such as contains, which require traversal of the queue, require linear time. The iterators are weakly consistent.

Lists: A list is a collection which—unlike a set—can contain duplicates, and which—unlike a queue—gives the user full visibility and control over the ordering of its elements. It includes methods for positional access, search, range view. The subList() method works in a similar way to the subSet(), were it returns a view of part of the list from which it was obtained, so changes in it are reflected in the original list. If elements are modified in the backing list then any later attempts to use subList results ins ConcurrentModificationException.

ArrayList: ArrayList is a List backed by array and is resizable and extensible. When ArrayList grows to the point where its size is equal to its capacity, an attempt add another element leads to replace the backing array with larger one holding old contents and new element, which makes its amortized cost to O(1). The get and set operations takes constant time, while add and remove require adjusting other elements, and hence proportional to size of the list. ArrayDeque provides better insertion and removal performance than ArrayList, along with random access as its implemented using circular queue. The initial capacity of an ArrayList created by the default constructor is 10. The iterators of ArrayList are fail-fast.

LinkedList: LinkedList is backed by a doubly linked list and is bad for random access. Adding or Removing just requires link manipulations so is better than ArrayList in that sense. LinkedList takes constant time for adding and removing elements.

CopyOnWriteArrayList: It provides thread safety along with very fast read access. The cost is that the array which backs the collection has to be treated as immutable, so a new copy is created whenever any changes are made to the collection. This cost may not be too high to pay if changes in the set of observer objects occur only rarely. CopyOnWriteArraySet delegates all of its operations to an instance of CopyOnWriteArrayList. If frequent writes are required along with thread safety then a synchronized wrapper is used around ArrayList or LinkedList.

Map: It does not inherit from Collection interface. It defines the operations that are supported by a set of key-to-value associations in which the keys are unique. Map provides methods such as entrySet(), keySet() and values() to return collection views. The collections returned by these methods are backed by the map, so any changes to them are reflected in the map itself, and vice versa. Elements can be removed directly or via an iterator over the view, but cannot be added as will get an UnsupportedOperationException. Removing key removes the key-value association while removing values only removes association mapping it.

HashMap: HashMap has an fixed sized internal array of buckets, each identified by a unique number. HashMap stores both key and value object as Map.Entry in bucket. While adding a key-value pair to HashMap, the hashcode for the key is computed and the pair is stored in the bucket with the identifier equal to hash code of the key. At some point of time hash function will return same bucket location for two different keys known as a collision in HashMap. In such case, a linked list is formed at that bucket location and new entry is stored as next node. In order to retrieve object from this linked list an extra check to search correct value is required which is done by equals() method. Since each node contains an entry, HashMap keep comparing entry's key object with passed key using equals() and when it return true, Map returns corresponding value. The equals() and hashcode() methods are not used in case of null keys in HashMap. HashMap provides constant time performance for put and get. The iterators are fail-fast.

LinkedHashMap: LinkedHashMap refines the contract of its parent class, HashMap, by guaranteeing the order in which iterators return its elements.
In LinkedHashMap iteration; elements can be returned either in the order in which they were inserted in the map, or in the order in which they were accessed (from least to most recently accessed) using the accessOrder constructor flag. Access-ordered maps are especially useful for constructing Least Recently Used (LRU) caches. The removeEldestEntry (method) contract states that the methods put and putAll will call removeEldestEntry whenever a new entry is added to the map, providing the opportunity to remove the eldest entry each time a new one is added. In LinkedHashMap itself, removeEldestEntry does nothing and returns false. Iteration over a collection of keys or values returned by a LinkedHashMap is linear in the number of elements. The iterators over such collections are fail-fast.

WeakHashMap: An ordinary Map keeps ordinary (“strong”) references to all the objects it contains. But suppose that the objects of the key class are unique—that is, object equality is the same as object identity e.g. unique serial number. In this case, once we no longer have a reference—from outside the map—to an object being used as a key, we can never look it up again, because we can never re-create it. So the map might as well get rid of the key-value pair and conserve memory, which is the idea of WeakHashMap. Internally WeakHashMap holds references to its key objects through references of the class java.lang.ref.WeakReference. A WeakReference introduces an extra level of indirection in reaching an object. For example, weak reference to the string "code gui" is made as below:

WeakReference<String> wref = new WeakReference<String>("code gui");

And at a later time we would recover a strong reference to it using the get method of WeakReference, e.g. weakRef.get(). As weak reference does not prevent garbage collector from reclaiming the object, the recovered reference may or may not be null. Only strong references are followed to determine reachability, so the keys of a WeakHashMap will be available to be reclaimed if they are not reachable by any other route by garbage collector. A key cannot be reclaimed if it is strongly referenced from the corresponding value. Before most operations on a WeakHashMap are executed, the map checks which keys have been reclaimed, without just checking null keys as null is valid value for the key in WeakHashMap. The WeakReference mechanism allows us to tell the garbage collector to leave information in a ReferenceQueue each time it reclaims a weakly referenced object. The WeakHashMap then removes every entry of which the garbage collector has reclaimed the key.
Its drawbacks is that it weakly references the map’s keys rather than its values, which usually occupy much more memory, hence until stale entries are not removed from map the memory benefit is limited. The second drawback is that weak references are too weak; the garbage collector is liable to reclaim a weakly reachable object at any time, and the programmer cannot influence this in any way. The iterators are fail fast.

IdentityHashMap: In IdentityHashMap, two keys are considered equal only if they are physically the same object: identity, rather than equals, is used for key comparison. IdentityHashMap is a specialized class, commonly used in operations such as serialization. The algorithm which traverses the graph be able to check for each node, whether that node has already been seen; to detect graph cycles. Collisions are handled by linear probing, a technique in which the key and value references are stored directly in adjacent locations in the table itself rather than in cells referenced from it. With linear probing, collisions are handled by simply stepping along the table until the first free pair of locations is found.

EnumMap: Similar to EnumSet were in an array implementation, the ordinal value of each enumerated type constant can serve as the index of the corresponding value. Read and write operations takes constant time. It is not thread-safe and iterators are weakly consistent.
An EnumMap contains a reified key type used to check validity of new entries.

SortedMap: SortedMap adds to its contract a guarantee that its iterators will traverse the map in ascending key order. A SortedMap imposes an ordering on its keys, either that of their natural ordering or of a Comparator, were compare should be consistent with equals method.

NavigableMap: NavigableMap extends SortedMap. It returns result of type Map.Entry. Map.Entry objects returned by its methods are snapshot objects reflecting the state of
the map at the time they were produced, and do not support setValue.

TreeMap: SortedMap is implemented in the Collections Framework by TreeMap. The internal representation of a TreeSet is just a TreeMap in which every key is associated with the same standard value. Its basic operations (get, put, and remove) perform in O(log n) time). The collection view iterators are fail-fast.

ConcurrentMap: Collections.synchronizedMap does not provide high-throughput thread-safety because with full synchronization
each operation needs to obtain an exclusive lock on the entire map. Locking only a part of the collection at a time—lock striping—
can achieve very large gains in throughput. Clients need assistance from the collection itself to carry out atomic actions as there is no single lock for a client to hold to gain exclusive access.

ConcurrentHashMap: Its an implementation of ConcurrentMap providing throughput with thread safety. The contract states that the results of retrievals will reflect the latest update operations completed before the start of the retrieval. Updates also can often proceed without blocking, because a ConcurrentHashMap consists of not one but a set of tables, called segments, each of which can be independently locked. If the number of segments is large enough relative to the number of threads accessing the table, there will often be no more than one update in progress per segment at any time.
The constructor allows to provide the number of segments that the map will use to control its concurrency level. THe size() method obtains an exclusive lock to all the segments, obtaining the entry count from each, then unlocking them again. The collection views return weakly
consistent iterators.

ConcurrentNavigableMap: It inherits from both ConcurrentMap and NavigableMap, and contains just the methods of these two interfaces with a few changes to make the return types.

ConcurrentSkipListMap: It is similar to ConcurrentSkipListSet and implemented by a ConcurrentSkipListMap in which every key is associated with the same standard value. Basic operations (get, put, and remove) perform in O(log n) time); iterators over the collection views execute next in constant time.

Tuesday, April 21, 2015

Java Collections

No comments:

Post a Comment