The abstract of a new paper from a team at LinkedIn (Roshan Sumbaly, Jay Kreps, Lei Gao, Alex Feinberg, Chinmay Soman, Sam Shah):
Current serving systems lack the ability to bulk load massive immutable data sets without affecting serving performance. The performance degradation is largely due to index creation and modification as CPU and memory resources are shared with request serving. We have ex- tended Project Voldemort, a general-purpose distributed storage and serving system inspired by Amazon’s Dy- namo, to support bulk loading terabytes of read-only data. This extension constructs the index offline, by leveraging the fault tolerance and parallelism of Hadoop. Compared to MySQL, our compact storage format and data deploy- ment pipeline scales to twice the request throughput while maintaining sub 5 ms median latency. At LinkedIn, the largest professional social network, this system has been running in production for more than 2 years and serves many of the data-intensive social features on the site.
Read or download the paper after the break.
CREATE INDEX idx ON TABLE tbl(col_name) AS ‘Index_Handler_QClass_Name’ IN TABLE tbl_idx;
As to make pluggable indexing algorithms, one has to mention the associated class name that handles indexing say for eg:-org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler
The index handler classes implement HiveIndexHandler
CREATE INDEX index_name
ON TABLE base_table_name (col_name, …)
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, …)]
[IN TABLE index_table_name]
[PARTITIONED BY (col_name, …)]
[ ROW FORMAT …] STORED AS …
| STORED BY …
[COMMENT “index comment”]
The index can be stored in hive table or as RCFILE in an hdfs path etc. In this case, the implemented index handler class usesIndexTable() method will return false.When index is created, the generateIndexBuildTaskList(…) in index handler class will generate a plan for building the index.
It only stores the addresses of HDFS blocks containing that value. The index is stored in hive metastore FieldSchema as _bucketname and _offsets in the index table.
See the code from CompactIndexHandler,