Skip to content

bryanherger/vertica-datasketch

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Vertica DataSketches

This repository contains C++ bindings between Apache Datasketches library and Vertica Database. Datasketch functions are installed as UDX in Vertica so they can be run as SQL.

Details on the library and underlying algorithm can be found at https://datasketches.apache.org/

This extensions uses the open-source C++ implementation from https://github.com/apache/datasketches-cpp

Currently the theta sketch, Hll (HyperLogLog) sketch, and frequency sketch are implemented for Vertica, see examples below.

LogLogBeta is built into Vertica as approximate count distinct and synopsis aggregate functions. It's possible to run COUNT DISTINCT, save the result as a synopsis, and perform counts and merges on the synopsis. Please see the documentation at https://www.vertica.com/docs/latest/HTML/Content/Authoring/AnalyzingData/Optimizations/OptimizingCOUNTDISTINCTByCalculatingApproximateCounts.htm

Install

This library requires cmake 3.14+ "yum install cmake3" package should install the correct version, though you may have to build from source following this Gist: https://gist.github.com/1duo/38af1abd68a2c7fe5087532ab968574e. Then run:

mkdir build
cd build
cmake ../SOURCES
make

Additional build options can be enabled by running ccmake.

To install, copy the library and SOURCES/install.sql to a Vertica node. Edit install.sql and copy the correct library path and file name at the top, then run with vsql -f install.sql

Examples

Theta sketch and Hll sketch used to estimate cardinality. Frequency sketch is used to estimate most common items. Consider the following table and sketches:

dbadmin=> select * from freq order by v1;
 v1
----
 a
 a
 a
 b
 b
 c
(6 rows)

dbadmin=> select theta_sketch_get_estimate(theta_sketch_create(v1)) from freq;
 theta_sketch_get_estimate
---------------------------
                         3
(1 row)

dbadmin=> select hll_sketch_create(v1) from freq;
 hll_sketch_create
-------------------
                 3
(1 row)

dbadmin=> select frequency_sketch_create(v1) from freq;
 frequency_sketch_create
-------------------------
 [[a,3],[b,2],[c,1]]
(1 row)

Theta sketches also support set operations: intersection, union, difference (as a_not_b). Consider the following tables and examples:

Table setA, varchar field v1: a,b,c,d,e
Table setB, varchar field v1:     c,d,e,f,g

dbadmin=> select theta_sketch_get_estimate(
    theta_sketch_union(theta_sketch_create(setA.v1),theta_sketch_create(setB.v1))
) from setA, setB;
 theta_sketch_get_estimate
---------------------------
                         7
                         
dbadmin=> select theta_sketch_get_estimate(
    theta_sketch_intersection(theta_sketch_create(setA.v1),theta_sketch_create(setB.v1))
) from setA, setB;
 theta_sketch_get_estimate
---------------------------
                         3
                         
dbadmin=> select theta_sketch_get_estimate(
    theta_sketch_a_not_b(theta_sketch_create(setA.v1),theta_sketch_create(setB.v1))
) from setA, setB;
 theta_sketch_get_estimate
---------------------------
                         2

Known issues

Vertica sends data to UDX in blocks, and depending on the type and size of the field(s) you are using in sketches, the block size may be too small for efficient batching, resulting in long UDX calls.

You can increase the block size to send more records per batch as follows:

ALTER SESSION SET MaxDesiredEEBlockSize = 25165824;

It's recommended to set exact MB sizes (this example is 24 MB), and realize that more is not always better, I've found that 16-48 MB blocks are usually best.

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 94.1%
  • CMake 5.9%