Managing big data? Say ‘hello’ to HP VerticaOver the past few months, I’ve seen an increase in the following use case while working on performance and schema review engagements:

I need to store exponentially increasing amounts of data and analyze all of it in real-time.

This is also known simply as: “We have big data.” Typically, this data is used for user interaction analysis, ad tracking, or other common click stream applications. However, it can also be seen in threat assessment (ddos mitigation, etc), financial forecasting, and other applications as well. While MySQL (and other OLTP systems) can handle this to a degree, it is by no means a forte. Some of the pain points include:

  • Cost of rapidly increasing, expensive disk storage (OLTP disks need to be fast == $$)
  • Performance decrease as the data size increases
  • Wasted hardware resources (excess I/O, etc)
  • Impact against other time-sensitive transactions (i.e. OLTP workload)

While there are many approaches to this problem – and often times, the solution is actually a hybrid of many individually tailored components – a solution that I have seen more frequently in recent work is HP Vertica.

At the 30,000 foot overview, Vertica is built around the following principles:

  • Columnar data store
  • Highly compressed data
  • Clustered solution for both availability and scalability

Over the next few weeks, I’ll discuss several aspects of Vertica including:

  • Underlying architecture and concepts
  • Basic installation and use
  • Different data loading techniques
  • Some various maintenance/operational procedures
  • Some comparisons vs. traditional OLTP (MySQL) performance
  • Some potential use-cases
  • Integration with other tools (such as Hadoop)

While Vertica is by no means the silver bullet that will solve all of your needs, it may prove to be a very valuable tool in your overall approach to managing big data.

20 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Adrian Oprea

For those intrested into how Vertica works and it can be used(scripts,how to’s and other stuff) feel free to visit http://www.verticablog.com , or http://www.aodba.com/main_tutorials.php?page=vertica&id_tit=22 or http://www.aodba.com/main_articles_single.php?art=9999&page=vertica&id_tit=6
– here write my day to day stuff about how i work with Vertica.

David Chen

The biggest advantage of Vertica is the raw speed. It is extremely fast when compared to other analytical databases, and it boasts features that make joins extremely fast. The trade-offs are the high license costs, slow updates and Vertica’s insatiable need to eat thru tons of disk space.

Foo

In the future posts, it might be interesting to compare with InfiniDB or Infobright

Adrian Oprea

David Chen, you are very true about the fact that Vertica is “simply fast” but i will have to disagree with you on the fact that has an expensive license cost and also that is disk space hog!
License – is a as per TB of raw data loaded into the database; once inside you can replicated 1 million times it will still be under the 1 TB row license, you can delete the unwanted(old/archive) data and free space (unlike other big data solution like Splunk).
Space disk – totally not true – one of the key assets in Verticas architecture is the fact that will take a mass of data and apply all types of encoding such as RLE encoding and compression so that the footprint will be as small as possible. I can tell you that depending on the data type you have, your disk size occupation will use 90% less space.(this is one of the key that make Vertica so fast – less block to retrieve) also the Vertica will query the encoded data with no extra CPU cost.
Bad Updates – that is true if you comapre it with a OLTP database system. Vertica will go nuts when you do updates(more then 22+ per sec) Why ? quick answer – Vertica never “overwrite” the data file so every time you update and new S.O. write will happen.
One thing i love about it is the fact that you can use any hardware you have unlike all the Teradata,Exadata,etc that come with their own hardware and gadgets.
I work with Oracle Exadata as well and i can tell you that a 4 GB on 4 CPU’s will load and query data faster then a Quarter Rack Exadata machine and i dont even use SSD drives on Vertica hosts.(this is only my opinion from my experience)
My self i love Vertica as is so fast, it can be tricky and still has a lot to develop in many ways.
One think to all that might think to use Vertica – Vertica is not to be used to replace your MySQL,Oracle,SQL Server etc.. OLTP database , Vertica is there to do the heavy lifting and help you do the analytics stuff with less expense(time, money).

Norbert Krupa

David Chen — Could you please back up what you mean by “Vertica’s insatiable need to eat thru tons of disk space”?

Foo — Infobright and InfiniDB are niche players compared to leaders such as Vertica.

Cristian B

Vertica is $100,000 per TB of data (raw uncompressed is measured). Definitely not for small players.

Norbert Krupa

Cristian B — Do you have a source for your figure?

Adrian Oprea

Imagine Facebook or Zynga running their petabyte DW on top of a 100.000 per TB license software !!! don’t think so !
Talk to your HP sales representative and get the real numbers !

Hi Adrian,

I wonder if that highly compressed data is really an advantage. Wouldn’t that mean more load retrieving it? After all, there is no point in storing the data if you don’t intend to retrieve it at some point.

Hi Fadi,
Well that is the key to a fast result set !
Like i said we have compressed data and encoded data, the compressed data will require some extra cpu cycles while retrieved, but most of the times Vertica uses encoding as i describe in this article http://www.aodba.com/tut_output_mysql.php?tut=6&page=vertica , encoding creates a smaller footprints and by doing this data retrieval will be faster.
And maybe your concept of columnar databases is not up to date, as columnar mixed with Vertica’s technology(projections) will enable only the data that is in the predicates reach to be retrieved, not like ordinary RDBMS that unless there is an index present will need to read all the block in order to get the results.
I hope i was clear 🙂

Hi Fadi,
Just wanted to complete the previous post with a simple example:
We have the table EMPS with 46.000.000 rows (1.2G of raw data)
Example of the data content :
1,Torres,F,FR,Brainbox
2,Ortiz,F,VG,Mudo
…..
….
Once i load it into Vertica and apply all the encoding(i choose the best for my data type) the entire 46 millions rows of data will occupy only 1.1MB of space on my disk !
See table definition
CREATE PROJECTION emps_tt
(
id ENCODING RLE,
last_name ENCODING RLE,
gender ENCODING RLE,
country ENCODING RLE,
company_name ENCODING RLE
)
AS
SELECT id,
last_name,
gender,
country,
company_name
FROM public.emps
ORDER BY gender,
country,
last_name,
company_name,
id
UNSEGMENTED ALL NODES;

And used space:
dbadmin=> select row_count,cast((used_bytes/1024/1024) as float(2)) as MB from v_monitor.projection_storage;
row_count | MB
———–+——————
46000000 | 1.172

##Now let’s do this in Oracle(just as an example)- this is a an Exadata machine:##
– we loaded the data quite fast :).
select
segment_name table_name,
sum(bytes)/(1024*1024) table_size_meg
from dba_extents
where segment_type=’TABLE’
and segment_name = ‘EMPS’
group by segment_name;

TABLE_NAME TABLE_SIZE_MEG
—————– ————–
EMPS 1516

Table definition:

create table emps (
id int,
last_name varchar(30),
gender char,
country varchar(3),
company_name varchar(30)
);

So at the end of the data loading we get a 1516 MB os used space in Oracle and 1.1MB in Vertica – that is quite a space consumption difference ! don’t you think ? (and just to remember you that no indexes were created on the Oracle table!)
Again the space consumption ratio depends on your data type(of course i have used a data type that helped me make a point).
If in any ways i am wrong please fell free to correct me.

Henrik Nordvik

@Adrian Oprea: Since you used compression in Vertica, you should compare with Oracle with compression.
Since you have exadata accessible, can you test with hybrid columnar compression?
use “compress for archive high” (highest compression) or “compress for query low” (lowest compression) in your create table statement.

Adrian Oprea

Hi Henrik,
I am so happy you asked me this, well you see like i said is not compression is encoding and it comes by default if your run a Database Designer(is a Vertica built-in tuning tool- or at least i call it like that) not like Oracle where you need to specify it.
Just to make it clear i love Oracle database and i think it holds an important role on the market and Vertica is not here to replace any rdbms, here is the key of great system arch – make the best of each tech( as they serve a purpose )
I will try to apply the compress types and see the results. Just to remind you that any compression will require decompression so more cpus etc… Encoded data is read as it is !

Vadim Tkachenko

A pricing is an interesting question, and it is always a reason for speculations as you can’t find prices on Vertica web-site.
$100.000 per TB is not take from no where, if you google “vertica price per terabyte” , every link mentions $100K/TB,
I guess it comes from Ex-CEO interview in https://gigaom.com/2013/11/02/ex-vertica-ceo-hadoop-is-pulling-the-rug-from-under-the-database-industry/.

More recent slides mention significant price decrease to $20K/TB
http://www.slideshare.net/sasindia/future-of-analytics-is-here-15616679

Of course both of these can’t be a reliable source, so only way to know if to call Vertica sales and generate them one more lead.

Neo

We recently bought 12TB vertica license for $85K after few rounds of negotiations. Hope this helps

Adrian Oprea

Just to put some light into the Vertica licenses
-All development,homologation or desaster sites (they are under the initial production License – no extra money)
-Raw data size based (valid to store up to some amount of raw data) -License is only applied to the RAW data loaded – once loaded it can be replicated as many times as you like.
-Term-based (valid until a specific date).
-Both term-based and data-size-based.
-Unlimited duration and data storage(perpetual).
-Community version will only allow you to create a 3 nodes cluster and a max of 1TB(no bkp available 🙁 – and others things are as well not possible)
And yes price may vary as per location and demand – i think is all bout the way you negotiate.

I am not a HP Vertica rep, so i have no intend of convincing anybody.

Norbert Krupa

Neo, that’s a great price ($7k/TB). What coupon code did you use? VERTICAROCKS?

Dmirty I.

Better off with free InfiniDB! 😉

Adrian Oprea

Maybe for you personal projects !
I wanna see you dealing with petabytes corporate stuff ! and using open source software.

Levi Brereton

This is the nice blog!
I caught so much information from this blog and this blog is on Vertica Monitoring. Vertica Monitoring is about that who is able to access your database, why and when. Some strategies are here by which the whole data should be secured.