Improving Seek Time for Column Store Using MMH Algorithm

Hash based search has, proven excellence on large data warehouses stored in column store. Data distribution has significant impact on hash based search. To reduce impact of data distribution, we have proposed Memory Managed Hash (MMH) algorithm that uses shift XOR group for Queries and Transactions in column store. Our experiments show that MMH improves read and write throughput by 22% for TPC-H distribution.


I. INTRODUCTION
Searching in Column Store (CS) is greatly influenced by the address lookup process.Hashing algorithms have been widely adopted to provide fast address look-up process [2,3,8].Bob Jenkins' hashing algorithm processes the key twelve octets at a time; the post processing step is slightly more complex because of handling of partial final block [14] in CS.However, it is possible to improve the throughput rate for fast address lookup in CS.
For various data warehouse applications, address lookup performs major role in performance measurement.The related and existing techniques of hashing and lookup are discussed in Section 2. Hash scan participates in performance of CS; Section 3 summarizes the hash scan for simple and complex queries.The proposed algorithm is an improved version of Jenkins' algorithm named as MMH.The informal and formal description of algorithm is discussed in Section 4. Case study was presented to show the effectiveness of our algorithm MMH with the help of implementation details in Section 5. Result analysis of MMH over Jenkins' is discussed in Section 6.Finally, we conclude with future work in Section 7.

I.RELATED WORK
Hashing has been used most successfully to avoid block conflicts in interleaved parallel memory systems used in multiprocessors and vector processors.Linear skewing functions, computes the block number using integer arithmetic [2,3].Stride patterns are mapped conflict-free when the stride and the number of memory blocks are relative primes [4].
To minimize the latency in computing per-block address, fragmentation was introduced in the Burroughs Scientific Processor, however it wastes 1/17th of the memory [5].Fragmentation and complex block number computations are not necessary to obtain conflict free access to stride patterns.It has been observed that some particular types of XOR-based hash functions that are based on the division of binary polynomials, can simultaneously map a large number of stridebased patterns conflict-free [6].XOR-based interleaving functions mainly focused on constructing a conflict-free hash function for several patterns complete with success [15,8].Bob Jenkins' hash produces uniformly distributed values for the hash tables [14].However, literature reveals that there is a scope to improve the seek time of Jenkins algorithm for Column Store.

II. COLUMN STORE HASH SCAN
This section describes Hash Scan for simple and complex queries both for column store.

A. Hash Scan for Simple Queries
The complexity of hash scan is highly influenced by the size of data warehouse.Hash function may use partial or entire record as key to generate hash value.The parameters for hash based search are selectivity and cardinality for the given query.For shift XOR, with uniform distribution, if the key is having n values, probability density function (pdf) is:

B. Hash Scan for Complex Queries
Assume the given relation has multiple attributes stored in CS architecture.Let AK is the length of attribute, LID is the length of the tuple identifier or primary key and MROW is the matched row of second segment.
The number of seeks for given query is expressed as:  1 and Table 2).As can be seen, proposed algorithm performs better for TPC-H schema.

II.RESULT ANALYSIS
The proposed algorithm performs uniformly and efficiently independently of data size.From experiments with large sets of keys we have observed that with poorly chosen hashing function, performance can deteriorate markedly as the number of keys increases (Figure 1).Experimental results for the expected length of the load search time (LST) values vary significantly between runs.We chose a random set of TPC-H schema keys, the distribution of LST values is even narrower.MMH improvement to average LST is 30% on Red Hat Linux 2.4 GHz Intel processor and 1GB of RAM.(Figure 2).To our knowledge these are the first experiments testing these predictions.

V. CONCLUSION AND FUTURE WORK
The proposed algorithm is a generic search algorithm for CS data storage.The algorithm is designed specifically for use in query intensive environment.A key design principle of MMH to improve the throughput by minimizing the disk seeks.To achieve we used the hash function of shift-XOR class.We experimentally demonstrated gain in performance by MMH.The continued evolution of hard disk technology should make such performance advantages clearer in the future.The most obvious avenue for future work is an extension of MMH algorithm for multiple instances of CS.The most significant question that must be addressed when extending the MMH to a multi-instance environment is handling synchronization for various disks seeks.

Figure 1 .
Figure 1.Result Analysis for Transaction Query Time

Figure 2 .
Figure 2.Figure 2: Result Analysis for Transaction Load Time

Figure 2 :
Figure 2.Figure 2: Result Analysis for Transaction Load Time

End of main */ CSXOR(Heap h, const char v)
Number of seeks required to retrieve tuples from the scanned segment.To support the hypothesis, we experimentally evaluate the MMH on real data sets i.e.TPC-H schema.In our experiments, we have focused on certain table sizes and load factors, to allow comparisons with original algorithm.We first investigated average search lengths for successful and unsuccessful search.The MMH results are compared to Jenkins' algorithm (Table