DaveWentzel.com            All Things Data

Vertica

Vertica Management Tools

This blog post in my Vertica series is about the various management tools available to a Vertica user.  We briefly surveyed some of these tools in previous posts.  If you've been following this series we now have Vertica installed, we've installed the VMart database using admintools, and we've installed the Management Console GUI tool.  Let's cover these tools and a few others in greater detail.  

admintools

I already briefly covered admintools when we looked at Creating a Vertica Database.  The Administration Tools provide a GUI-like experience (on Linux you can get a GUI-like appearance...if you don't mind your tool looking like WordPerfect 5.1 for DOS with no mouse support) without requiring a windowing manager like KDE or Gnome.  Most Linux folks hate running GUIs on servers...it's just more stuff that will crash the OS and drain performance.  Admintools can be launched from a shell (ssh, telnet, xterm) by running the following commands:

  1. Ensure you are logged in as dbadmin.  If you attempt to use your root-level account you'll most likely get the error noted to the right.  
  2. /opt/vertica/bin/admintools

However, admintools, itself, is scriptable.  Running admintools -a will show you everything you can do with admintools.  Here are some interesting things you can script like this:  

  1. create a new database (perhaps as part of a CI loop):   /opt/vertica/bin/admintools --tool create_db -s (nodeIPs) -d <dbname> 
  2. start a database:   /opt/vertica/bin/admintools --tool start_db -d<dbname>
  3. connect to a database (this essentially starts vsql):   /opt/vertica/bin/admintools --tool connect_db -d <dbname>

Management Console (MC)

I already covered quite a bit on MC in nodetitle.  MC doesn't do everything admintools does (just like SSMS doesn't do everything TSQL does) but it is a great GUI tool where GUI tools are really needed and aren't cumbersome...such as detailed performance monitoring graphs).  

Database Designer (DD)

Database Designer can run from MC, admintools, or can be scripted stand-alone. It can be run by admins all the way down to basic reporting users.  A report developer may want to run this, for instance, to ensure that a new query has the correct projections available for performance reasons, just like a good SQL Server developer would ensure that their query plans are using the correct indexes.  DD functionality will be different based on the login. For instance, you may be able to create a new design, but not execute it.  DD's purpose is to analyze your logical schema, sampledata, and your sample queries and then create a physical schema design (basically projections) that can be deployed or scripted. These designs are optimized for column-oriented storage.

Your designs can create new projections for your tables based on new sample data or you can pass it queries and it will create designs for you based on that workload. The deployment scripts can even be designed on a dev instance that is k-safe=0 and deployed on a higher k-safety instance.  

In the next post we'll take a deeper dive into Database Designer.  

vsql

vsql is Vertica's shell-based query tool.  It's functionally similar to osql or sqlcmd in SQL Server, or sqlplus in Oracle.  It does not run on Windows.  If you need a query tool for Windows the suggestion is to use ssh or a third party querying tool.  There is also no GUI-based querying tool such as SSMS.  This is a foreign concept to SQL Server DBAs but is a standard practice for most other RDBMSs.  For instance, Oracle shipped no GUI-based query tool until Enterprise Manager in 8.1.  Prior to that most people used Toad.  Here's how you use ssh and vsql.  

  1. connect to your server with ssh as dbadmin or an equivalent user
  2. /opt/vertica/bin/vsql
  3. enter your VMart password
  4. use \q to quit

Even though you can't install vsql on Windows you can install it quickly on another Linux box.  

  1. Ensure your Linux box is the same version as your Vertica box...same distro...etc.  
  2. scp <remotehost>:/opt/vertica/bin/vsql .$ ./vsql

You can also download the vertica .deb file and install it, WITHOUT running install_vertica.  

There are many other management tools such as install_vertica and update_vertica that we've covered already.  There is also vbr.py which is the native backup utility we'll cover in a future post.  

One tool I wish Vertica had was a graphical, web-based querying tool integrated directly into MC.  Unfortunately this doesn't exist yet.  In a future post I'll show you my favorite 3rd party open source tool that covers this shortcoming and makes the Vertica developer's life much easier.  

Scrutinize and Diagnostics

/opt/vertica/bin/scrutinize and /opt/vertica/bin/diagnostics

These are both diagnostic tools that work similarly with similar goals.  I would've merged them into one general diagnostic reporting tool.  They work similarly to pssdiag in SQL Server.  They both collect a broad range of information from all nodes in the cluster.  The files are zipped and can be uploaded automatically by setting a switch in the call.  


You have just read "Vertica Management Tools" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

Tags: 

Installing Vertica Management Console

In the previous post in my Vertica series, nodetitle, we created the sample VMart database that HP provides.  By doing this we were able to do rudimentary tasks using AdminTools.  In this post we are going to install Management Console (MC) which is HP's web-based GUI tool.  Unlike AdminTools, which runs on the Vertica instance and is invoked via a telnet or ssh session, MC is a web server that can be invoked by any browser with network connectivity to the Vertica cluster.  

Unlike SQL Server where the SSMS GUI is not required (all SSMS tasks can be carried out via TSQL commands) there are some tasks that only MC can handle and some that only AdminTools can perform.  And then there are some tasks that both tools can handle, but can be done with MC much easier (the Database Designer being one).  MC can graphically manage the layout of your schema for you, AdminTools cannot.  

In Linux when you REALLY need a GUI admin tool the solution is usually to run a httpd process somewhere that accepts requests from a client web browser and constructs the necessary Linux commands under-the-covers. That's exactly how MC works.  Almost anything you do in MC can be scripted to SQL commands and saved for late.  

One MC instance can manage many Vertica clusters, each with different security, k-safety etc. Just like SSMS. Your MC version must match your Vertica version however.  Your MC users can be different from your database users and you can provide fine-grained permissions to various features in MC.  You can install MC before or after you install your first Vertica node. MC will auto-discover your existing db clusters if you install it on an existing cluster node.  This is why I've waited until now to cover MC installation.  

MC uses ports 22, 5444, and 5450 for cluster communications.   5450 is used for browser communications over https.  

Installing MC

  1. From your laptop  (remember, Ubuntu server has no GUI) login to my.vertica.com (accts are free) and download the Vertica console .deb package for your OS and processor architecture (for instance,  vertica-console_7.0.2-1_amd64.deb).  
  2. My advice is to install MC on one of the nodes of your cluster.  This isn't necessary and in a prod environment may not even be a smart thing to do.  So far we have only one node of our development cluster so we will install directly on that.  When you do install on a cluster node Vertica will recognize that fact and auto-discover that node and add it to MC automatically.  
  3. I assume you are connecting to your Ubuntu box using PuTTY with SSH. If so you can use pscp to copy your downloaded .deb file to your server. You can google to download pscp if you are running stand-alone PuTTY.
  4. Open a DOS prompt (elevated) and cd to the location of pscp
  5. pscp <location of .deb file> <user>@<ip>:<path on ubuntu>
  6. Login to your Ubuntu server as a sudoer (not dbadmin, which is not a superuser)
  7. sudo dpkg -i <pathname> 
  8. be patient.  This is installing misc java files and any missing dependencies.  
  9. Eventually you will asked to open https://ip:5450/webui to complete the installation.  Leave your terminal/ssh session open, do not close it.
  10. You should first be presented with a license screen in your browser.
  11. Step 2 you will provide a "superuser" MC user.  My advice is to just use the default (see screenshot), which is dbadmin from the underlying Vertica installation on your node.  
  12. You will then pick an authentication mechanism for non-superuser access to MC.  I'm not even going to attempt configuring LDAP.  
  13. At this point you likely won't see anything happening in your browser window.  Don't refresh or close your browser.  If you switch back to your terminal session you can watch config messages scroll through the screen while Vertica builds the MC application given your configuration preferences.  
  14. Eventually the terminal window will flash a message that the installation is complete and will give you the URL to log in to.  At this point you can close the terminal session and the configuration browser window. 
  15. Open the URL and you should see a login screen.  Log in with the superuser account that you previously created.  
  16. You should be presented with the following screenshot.  

 

Using Management Console

You can learn a lot about Vertica and MC by just clicking around.  "MC Settings" allows you to modify the config values you specified above.  This is independent of any clusters you may wish to manage using MC.  

The first task you'll want to do is import your existing (one node) cluster.  Click "Database and Clusters" and then Import.  Since (if) you installed MC onto an existing cluster node the database will be auto-discovered.  You simply have to supply your username/password.  If you supply your dbadmin password you will be able to manage all aspects of your cluster.  If you only have a limited user account (say, because you only need access to database designer) then add that account here and you will see only limited functionality you have permission to perform.  

You should receive a success message.  You can now discover MC and lots of Vertica features on your own.  

I'll point out some interesting MC features that are worth exploring next.  

Message Center

The Message Center will display errors (like the Error Log in ssms) as well as alert conditions.  Here you can see that my cluster is not k-safe (we'll fix

that in a later blog post).  One interesting feature is that you have the ability to mark messages as "read", similar to email.  You can basically dismiss messages that you don't wish to see again.  This cuts down on the noise messages.  What is slightly confusing to noobs is how to "expand" a message to read the details.  You use the triange to the left of the priority indicator.  

Database Overview Page

You access this page by clicking on the "Recent Databases" option for the cluster you just configured.  

You will be presented with the dashboard/overview page.  

There are tabs along the bottom of the page to access additional functionality.  Most interesting is the Design Page which launches the graphical Database Designer wizard.  We'll cover this in detail in a future blog post.  In a nutshell, this utility let's us tweak our physical designs for performance and k-safety.  

Finally, the Explain tab allows you to execute SQL statements and view performance metrics.  I quickly typed a simple query and generated an EXPLAIN plan.  

An EXPLAIN plan is equivalent to "Display Estimated Execution Plan" in that the query is not executed but an approximate query plan is returned.  The PROFILE option actually executes the query and shows detailed CPU and memory usage.  

You can find interesting VMart queries in /opt/vertica/examples/VMart_Schema.  Use nano to examine these files or copy them to MC and took at the EXPLAIN plans.  

We'll cover some additional management tools in the next post.  


You have just read "nodetitle" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

Tags: 

On Certifications

"Should I get a certification?"

That's a common question in IT and my opinion on certifications has changed over the years.  I recently got my Vertica Certification and in the process I've been re-evaluating the value I find in certifications.   

My First Certification

...was my Microsoft Certified Professional (MCP) in 1997.  I worked at a large "shop from home" television channel as a help desk guy.  The company allowed me to take a Windows 95 class and they paid for the certification.  I already had my B.A. and I didn't find a lot of value in having an MCP.  

Microsoft Certified Trainer

I didn't stay at that company for long.  I went to college to be a high school teacher but realized I hated teaching kids.  So I thought I could teach adults SQL Server, databases, and programming...something I did to help put myself through college.  I was hired by a MS Gold Partner and soon became an MCT.  Within a matter of a few months I held my MCSE and was working toward my MCSD.  Being an MCT for various products requires that you pass the exam for those products.  This usually meant learning just enough to pass the test so I could teach classes and earn my pay.  My breadth of knowledge was diverse, but had very little depth.  

As the aphorism goes, "those who can't do, teach".  I was teaching Exchange Server after just barely passing the exam the week previous.  As long as I stuck to the courseware I was fine and students loved me.  I even won "Instructor of the Month" for all of the United States in April 1999.  But if I veered from the courseware, or a student asked a moderately difficult question, I looked foolish.  This really wasn't what I wanted out of my career.  I felt like a shallow fraud.  

"Paper MCSEs"

It was around 1999 when I first noticed the ads on the radio touting, "Get your MCSE and make $70,000 per year your first year.  We can help you!"  These were basically bootcamps where you were given the answers to the exam questions and were allowed to briefly work with the technology.  They had similar bootcamps for Java Jocks.  This was the Dot Com Era after all.  The market demand for technologists was never higher.  This disgusted me.  My training company was also a Prometric Testing Center and I saw a lot of test takers come in to take the tests to get their certifications.  It was common to see people in the parking lot studying their "brain dumps" that they downloaded from the nascent Internet.  TroyTec was becoming hugely popular.  I saw tons of old TroyTecs laying around near the exam area that testers left behind as they were trying to cram just before taking a test.  

These people were "Paper MCSEs" and it put a sour taste in my mouth for a decade regarding certifications.  

One Particularly Disgusting Experience

Every Prometric exam room had a security camera on the wall plainly in sight of every test taker.  Our office administrator had a CCTV sitting prominently on her desk that she used to monitor the examinees.  Obviously this was to discourage cheating.  In fact, if you took an exam you had to sign a little waiver indicating that you would not cheat or sell the questions to a brain dump site.  

I came back from lunch one day and our office admin asked me to look at the TV with her.  Right there, clearly, was a test taker thumbing through a folder of notes looking for answers.  And there was the camera right over his head.  We often saw people cheating...you could see them looking at note cards they had up their sleeves for instance.  The office admin asked me what she should do in a case this brazen.  

My response still sickens me.  

I told her to ignore it.  

If you accuse someone of cheating it can wreck their career.  What if, somehow, he wasn't really cheating?  What if we accuse him of cheating and somehow this blows up in OUR face?  I would prefer not to be involved AT ALL in these scenarios.  As I saw it at the time, if you cheated and got away with it...good for you.  You're a fraud and you won't last long at your job.  

These cheaters were the rule, not the exception, during this era.  I viewed a lot of it.  It disgusted me and jaded my opinion of anyone who had a certification but no real-world experience.   

Networking Essentials

You could successfully cheat for just about any MS exam, but Networking Essentials was the hardest.  It was the widowmaker.  I saw guys take that test 4 or 5 times and still not pass.  "Net Ess", as we called it, required you to actually DO networking.  You couldn't just memorize brain dump answers.  You had to conceptually understand routing.  And you were asked to do math involving subnetting and supernetting.  You even needed to know about "plenum wiring", which I had run for 4 years as an intern in college.  The test was still multiple choice so it was possible to cheat, but it would likely take you a few tries.   

If you wanted to know if an MCSE was of the "Paper" variety, simply ask them how many times it took them to pass Net Ess.  Or ask them a question from the test.  

MCDBA

In 1999 MS released SQL Server 7.  This was a radical departure from SQL 6.5 on many levels I won't rehash here.  I had limited experience with 6 and 6.5 and IMHO the industry scuttlebut was that SQL Server was only a "departmental" database...not able to scale and too prone to crashing on Windows NT.  With SQL 7 the industry's views seemed to change overnight.  SQL Server was now considered a suitable database for these new e-commerce apps that were becoming popular, and mission-critical.  

Shortly after SQL Server 7 was released MS announced the availability of the MCDBA certification.  Since I worked for a Gold Partner I was privy to "inside information".  I learned that the two SQL test requirements (Administration and Design) for the MCDBA wouldn't be quite so easy anymore.  Was Microsoft finally going to have stringent testing.  Would there be no Paper MCDBAs?  Although still multiple choice the questions would be more scenario-based.  Gone would be pedestrian questions like "Where is the data for a view stored?" 

I begged my employer to let me take the tests before they were GA.  I studied and passed both exams on the first try sometime around August 1999.  I was probably one of the first MCDBAs outside of Redmond.  The tests were not easy and I remember seeing many candidates fail the exams for the next few months.  Around this time MCSEs were everywhere but did not command much of a premium in the market.  But an MCDBA did.  

The Windows 2000 "Upgrade" MCSE

I soon tired of being a technical trainer and became a full-time data professional in late 1999.  It was easy to forget my former career.  

When Windows 2000 was announced MS also announced that the Windows NT MCSE would be retired.  If you held that "version" of the MCSE (which I did) you could no longer call yourself an MCSE without taking an upgrade exam.  This pissed me off.  This was like having my B.A. removed...I earned it I should get to keep it.  I wasn't using any of the new features of Win2K, why should I upgrade?  I still felt that there was a bit of value in having the MCSE logo on my resume so I decided I would at least attempt the upgrade exam.  The upgrade exam was available until January 2002 and I scheduled my exam for the last week of December.  I didn't study and I barely passed.  

What really pissed me off is MS reversed their decision a year later and let everyone keep their MCSE designation even without the upgrade exam.  It is this kind of attitude by MS which is why I'll likely never hold another MS certification again in my life.  They compounded bad decisions upon bad decisions.  I talk to a lot of veteran MCSEs who still are not really sure if they are allowed to use the MCSE logo or not based on these ridiculous rule changes.  

Oracle OCA/OCP

In 2002 I was using Oracle every day designing ETL packages.  My employer sent me to training so I really just needed to pass a few tests to become an OCP.  I passed each exam easily.  But OCP exams are much like MS exams...very pedestrian questions with obvious answers.  I'm competent with Oracle, but frankly my skill level isn't great.  I guess that makes me a Paper OCP.  The fact that I aced every exam kinda indicates to me that an OCP certification, like an MCSE certification, is not a good measure of a potential new hire.  

The worst aspect of the OCP tests is that you don't need to know ANY PL/SQL.  Every day I encounter Oracle DBAs that cannot write even rudimentary PL/SQL packages.  I strongly believe this is because the OCP exams do not stress this.  

SNIA SCSP

It was another 4 years until I got my next certification.  I was working as a data architect at an ISV and our hosted environments were having HUGE I/O problems.  I knew at least the basics of SANs and storage and I/O but I was having problems communicating my observations with the SAN engineers.  They were denying their SANs had any problems...and I knew MOST of our problems were indeed the SAN.  I began to read anything I could about EMC, IBM Sharks, or anything storage-related.  After a few months I was able to articulate, with irrefutable proof, that we had SAN problems.  I even got the SAN engineers to admit it.  It took a few more months before our problems were fixed.  

Maybe I could pass the SNIA SCSP.  I took the test and aced it the first time.  No one paid me this time, I simply wanted the certification to prove to myself that I knew the technologies as well as I thought I did.  Even though I aced it, I guessed at every question involving tape libraries and NAS...I didn't feel those subjects were germane to a data architect.  But apparently they are very important to the fine folks at SNIA.  

I think the SNIA SCSP is the most valuable certification I have.  It is an industry-group certification, not a vendor certification.  This means that it is vendor-agnostic.  You are tested on concepts and theory vs tools.  With MS and Oracle exams there is clearly a push to have candidates learn the newest features of the newest release, even if those features have a limited applicability to most practitioners.  

Vertica, MySQL, etc

I recently took my Vertica exam and passed which makes me an HP ATP.  I've been using non-traditional DBMSs a lot lately, especially Vertica.  I had no formal training in Vertica, I merely picked it up by reading and building my own test servers.  I even have a Vertica series of blog posts where I'm disseminating some of the key Vertica concepts that initially tripped me up.  I took the test to prove to myself that I knew Vertica and to determine what areas I needed improvement.  In other words, I took the exam for ME.  I'll put it on my resume not because I value the certification, but because I do have valuable Vertica experience.  

Unfortunately I think the Vertica exam is a complete waste of time.  Here's why:  

  1. It is an online Pearson exam.  This means that an examinee can have the test open in one browser and can do google searches for answers in another.  
  2. The questions are about totally esoteric aspects of Vertica such as what screens in Management Console show you CPU utilization.  This is tool-based knowledge.  A better question might be how do you decrease CPU utilization for a given query.  
  3. There are questions clearly geared towards older releases of Vertica.  Terms have changed in the newer releases but the tests have not been updated to reflect these changes.  
  4. There are too many questions about installing Vertica and not enough about performance and k-safety.  In fact, there were no questions on k-safety.  Installing Vertica is a task done, at most, a few times a year.  The latter is done possibly every day.  Further, there were ZERO questions on querying data.  

This is just another vendor exam that focuses too much on the tool and not enough on the concepts.  

Parting Thoughts

I find little value in most certification exams.  Vendor exams are almost worthless.  There is too much emphasis on the newest features which are likely not being used, yet, in most shops.  And there is too little emphasis on foundational knowledge.  Industry certifications (SNIA, PMI/PMP) are much more valuable.  A thorough understanding of the concepts is required, and those concepts are transferable to most vendor implementations.  

I especially like the approach that EMC is taking.  They offer a Data Science Associate certification.  This is a holistic "architect" certification.  You must know a little bit of everything in data lifecycle management.  Design is covered, relational concepts are covered, querying in multiple languages is covered, as well as storage (this is an EMC certification after all).  Even though this is offered by a vendor there is nothing vendor-specific about it.  


You have just read "On Certifications" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

Examining the VMart database using vsql

In Creating a Vertica Database I covered how to install the VMart sample database (similar to AdventureWorks in SQL Server) and populate it with sample data.  In this post we'll take a deeper dive into VMart, how to perform basic queries using vsql (similar to sqlcmd), and examine how data is physically persistent to storage media .  We haven't covered installing Management Console yet so we'll explore VMart using good 'ol SQL commands.  We'll cover using GUI tools in a future post.  

The Schema

cd into /opt/vertica/bin

and run vsql -h <IP address> -d VMart -U dbadmin

We can run a series of commands to explore the VMart schema.  In SQL Server we have sys catalog views (or INFORMATION_SCHEMA views in ANSI-compliant RDBMSs).  The equivalent in Vertica is the V_CATALOG views.  

Vertica Command Purpose/SQL Server equivalent Output
select * from schema where is_system_schema = 'f' sys.schemas.  This will give you the non-system schemas.  Note that booleans are f and t in Vertica.  

select * from all_tables where table_type = 'SYSTEM TABLE' order by table_name LIMIT 10; Shows you the first 10 system tables.  Note that Vertica is case-sensitive.  Using 'system table' will generate 0 rows.  
select table_name from all_tables where schema_name = 'store' Show me all tables in the store schema
select table_name from all_tables where schema_name = 'public' Show me all tables in the public schema
customer_dimension
select table_name from all_tables where schema_name = 'online_sales' Show me all tables in the online_sales schema
select * from columns where table_name = 'online_sales_fact' similar to sys.columns
select constraint_name from foreign_keys where referece_table_name = 'date_dimension' sys.foreign_keys...list all foreign keys referencing the date_dimension table
select constraint_name, constraint_type from table_constraints sys.check_constraints and sys.key_constraints.  primary key, foreign key, and check constraints

This should give you a feel for how to explore the catalog and your database.  

Physical Data Persistence

Every DBMS persists its data to storage media in a different fashion.  In SQL Server you specify the name and location of all data files.  Same with Oracle.  In MySQL you specify a directory and MySQL does whatever it wants with that folder.  Vertica works similarly to MySQL.  I noted in Creating a Vertica Database that you must specify a single data directory for a database's data.  With VMart now loaded we can now examine the directory structure.  What you'll notice is Vertica creates a lot of folders within each folder is a series of subfolders and files.  

I couldn't find anything authoritative regarding the make-up of these structures.  But I do know 2 things for sure:  

  1. Once a data file is created it is never updated or modified.  The files, and the underlying data, are encoded and compressed (depending on your design settings) with no "air" in the files for new data to be inserted between existing data (ie, there is no concept of FILLFACTOR).  When rows are DELETEd or UPDATEd (which is a DELETE followed by an INSERT) a "delete vector" file is created that marks which rows should be removed from any result sets.  After a period of time the delete vectors are removed and the original files are written to new files without the deleted data.  
  2. Since files are never modified backups can be performed using any standard file backup utility including rcp (remote copy).  There is no special buffer pool in Vertica...data files are buffered using standard unix mechanisms.  

You have just read "Examining the VMart database using vsql" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

Tags: 

Creating a Vertica Database

This is the next post in my Vertica series.  In a previous post we covered nodetitle.  I finished that post by mentioning that once you have a node installed you can connect and do basic tasks using /opt/vertica/bin/admintools.  If you aren't familiar with Linux then even that simple command can throw you for a loop.  Here is the exact process to start your exploration:

  1. Log out of Ubuntu if needed.  
  2. Log in as dbadmin with your dbadmin password
  3. cd /opt/vertica/bin
  4. ./admintools

You should be presented with an "old-school console GUI".   This will work over telnet or SSH and although it looks kinda like a Windows GUI you won't be able to use your mouse for anything.  You navigate using Enter and the up and down arrows.  You can poke around the menu options but you'll find you can't do anything with Vertica yet because we have not created any databases. 

VMart

 HP's sample database is called VMart. It is a data warehouse for a supermarket chain. Every installation of VMart is a bit different because the data is generated on-the-fly via a data generator. This makes query comparisons between Vertica installations a bit of a hassle vs something like AdventureWorks. HP also supplies a series of sample queries that demonstrate various Vertica features and can also be used for performance tuning with Database Designer. In this blog post we'll cover creating our first database by installing VMart.  

Installing the VMart shell db

  1. Start AdminTools
  2. Enter "Configuration Menu"
  3. "Create Database"
  4. Call it "VMart"
  5. You will need to enter a password.  This is equivalent to "sa" in SQL Server.  
  6. You will need to select the hosts for this database.  As mentioned in a previous post, Vertica nodes are named using their IP addresses.  Since you only have one host currently you should see that host listed for this database.  Click OK.  
  7. Enter the catalog and data file locations.  This will default to the home folder for your dbadmin user.  
  8. You'll get a warning that a database with 1 or 2 hosts cannot be k-safe.  That's fine, we'll fix that in a future post.  Just click OK.  

And that's it.  You now have a shell database akin to the model database in SQL Server.  At this point it is interesting to look first at the catalog and data directories.  You'll end up with a structure similar to this:  

Folder File Purpose
/home/dbadmin   Home folder for dbadmin user

/VMart

  folder containing your new VMart data
  dblog contains the log file for the spread daemon, which is an open source messaging system not unlike JMS.  Spread is used so individual cluster nodes can communicate with each other.  
  port.dat the port Vertica is listening on.  nano port.dat will show you 5433 which is the default Vertica port.  

/v_vmart_node0001_catalog

  this is the catalog folder.  You'll find various error logs and configuration files here.  
  vertica.log This, along with dblog, are the log files for your database.  

/v_vmart_node001_data

  this contains your data.  This folder will have 0 files since we haven't created any data or schema yet.  We'll come back to this later.  

This gives you an idea of what the file structures and naming conventions look like with a Vertica database.  

Running the VMart scripts

With the VMart database created we need to create the schema and load the data.  

  1. From the admintools Main Menu choose "2 Connect to Database".  This will launch vsql which is similar to sqlcmd in SQL Server.  We'll cover this in detail in a future post.  
  2. The VMart schema files are located in /opt/vertica/examples/VMart_Schema.  The README file show you various parameters to make a smaller or large sample data.  You can view any of the files in that folder using nano.  We'll take a look at some of these files in more detail in a later post.  For now we just want to install a base VMart db.  
  3. From vsql:  \i /opt/vertica/examples/VMart_Schema/vmart_define_schema.sql
  4. \q to quit vsql, which will return you to admintools
  5. Exit admintools.
  6. cd /opt/vertica/examples/VMart_Schema
  7. ./vmart_gen :  this will create a small default db.  What it really does is create a bunch of .tbl files in your folder.  These files are pipe delimited text files that can be read using Vertica's COPY command, which is equivalent to SQL Server's bcp command.  
  8. We have the .tbl files that we now need to load into VMart.  Run: vsql -h <IP address> -d VMart -U dbadmin  (this is an alternative method to launching vsql without launching admintools first).  
  9. \i vmart_load_data.sql :  this will load the .tbl files.  This may take some time.  

In the next post we'll take a look at the VMart schema, run some sample queries, and see how the data files are laid out on disk.  


You have just read "nodetitle" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

Tags: 

Vertica Physical Architecture

This post is the next in my Vertica series and covers how Vertica is physically architected.  I covered pieces of this when we installed our first Vertica instance.  Now we'll go into a bit more detail.  This is all leading up to installing a second Vertica node and sharding our data in a manner that helps us with both recoverability and performance.  

Instances, Nodes, Clusters, Databases

Only one instance of Vertica can run on a given host. This is also called a node. If a node fails the database will still recover if the cluster has at least 3 nodes. Kinda like RAID 5 for your nodes. HP recommends 4 nodes minimum and more is always better because additional nodes work just like SQL Server's readable secondaries...they are active for querying and updates. They are not just passive cluster nodes. And, they are not necessarily COMPLETE copies of any other node in the cluster, unlike SQL Server.  This means, similarly to the concept of RAID 5 for storage, that you can lose a certain number of nodes without bringing your database down.  Also, additional nodes with partial (or complete) copies of data also gives you a performance boost when querying.  

There is also something called a "control node" in Vertica.  You always have at least one control node but you can have many if your cluster is huge or geographically dispersed.  The control node can also migrate to other nodes automatically if the control node goes down without any DBA intervention.  The control node is the traffic cop.  As queries are submitted to Vertica the control node sends the requests to other nodes (...similar to Map...) to collect data and return it to the control node which then does any final calculations before returning the data to the requestor (...similar to Reduce in MapReduce).  All nodes run a process called "spread" which assigns nodes to be the control node if you haven't configured it yourself. 

K-safety

"K-safety" is how Vertica measures fault tolerance and is the key to understanding the node-to-data relationship. Let K equal the number of replicas of the data in a given cluster. This is not the same as the number of nodes. The more K-safety you have the more nodes can fail without impacting performance or having the db go offline. When a node comes back online it will automatically recover just like a failed disk in a RAID 5 array.  The failed node queries neighboring nodes for missing data.  The recovery status can be queried using the Management Console which we'll install and look at in a future post.  

The "K" can be 0, 1, or 2 and depends on your physical design. You don't want to screw up your design and end up with something that isn't as k-safe as you planned, not to mention not as performant.  The Database Designer (again, we'll cover this in a future post) helps you by noting your design's K-safety level. It offers suggestions on how to improve your k-safety and can even help you convert your existing design to be either more or less k-safe depending on your needs.  

  • K=0 means that if ANY node fails the cluster fails and your database will be marked offline for querying until that node is back online. If there was a hardware problem all you need to do is fix the hardware and upon reboot Vertica will begin the recovery process automatically.  If you need to entirely rebuild the node from bare metal then you MUST ensure the new node has the SAME IP address.  Vertica uses IP addressing, not FQDNs to refer to other nodes.  
  • K=1 means that any single node can be lost and the database will remain up (perhaps with performance degradation). This means that every node's data is "replicated" to at least one other node.
  • K=2 means that your db can remain up if ANY 2 nodes fail. Here's where it gets tricky. If you WANT K=2 then your design must ensure that if 2 nodes fail that a third node must have your data.

The formula for K-safety is simple: to get K-safety you must have AT LEAST 2K+1 nodes.

So, to have a K-safety of 2 (in other words, you want to survive 2 nodes going down simultaneously) you must have AT LEAST 5 nodes. Vertica only officially supports K=2. For argument's sake, if you really wanted K=3 you would need 7 total nodes, minimum.

So, how does data get to the other nodes?

Unlike distributing data in SQL Server which ALWAYS requires DBA intervention (be it replication, log shipping, readable secondaries) everything is automatic in Vertica.  You simply add a statement to the CREATE PROJECTION clause telling Vertica how to handle the sharding.  You have 2 choices: 

  1. When the projection is small and unsegmented it is replicated to ALL nodes.  No need to add the overhead of segmentation in this case.  
  2. When the projection is large and segmented then a "buddy projection" is copied to at least one other node.  This is done using offsets. In the graphic to the right the fact table is segmented by a hash function to every node in the cluster.  OFFSET 1 indicates that the buddy projection for a given node will be available on the next neighboring node in the cluster.  

All of this k-safety stuff probably sounds complicated and it is easy to screw it up if you go off-the-reservation and do something that the Database Designer did not recommend for you.  Even if YOU think you should be at k-safety = 2, that doesn't mean Vertica agrees with you.  SELECT current_fault_tolerance FROM system; will show you your current k-safety.  If it isn't what you expected just rern the Database Designer.  

Fault Groups

None of this K-safety stuff matters if all of your nodes are in the same rack and that rack loses power. This is a "correlated failure" and is a function of externalities. If this happens you certainly can't blame Vertica for your stupidity.  In Vertica you can define your own fault groups to give Vertica hints to understand these correlated failure points better and influence its activities accordingly.  For instance, defining a fault group will also let you smartly define control nodes.  You wouldn't want all of your nodes to be in one rack and your control node to be connected to a different switch in a rack on the other side of your data center.  

When half of your cluster nodes fail, your cluster fails, regardless of your K-safety.  But it's common to have buddy projections on many nodes in your cluster.  If possible it is best if those buddy projections/nodes have as little shared infrastructure as possible.  Say for instance, SAN storage.  If you have multiple SANs (or even VM hosts) you want those buddy projections to be as separate as possible for DR and performance reasons.  Fault groups are the ticket for all of this.  

What is the catalog?

When you install Vertica you must provide catalog and data directories and the path must be identical on all nodes. The catalog directory stores all of the metadata about your database; basically everything except your actual data. The catalog is akin to the PRIMARY filegroup in SQL Server although it contains many files. It holds the system tables, data about your nodes, snapshots, file locations, etc.

The catalog files are replicated to all nodes in a cluster whereas the data files are unique to each node.  Vertica uses lots of files to store its data.  Once a file is written to it is never altered, which makes recovery quite easy.  Vertica simply needs to copy missing files to the recovering node from any other node with the necessary replicated data or buddy projections.  Since files are never altered Vertica has no concept of FiLLFACTOR or PCTFREE.  Since files are columnstores it is guaranteed that neighboring data will have the same datatype, therefore Vertica's abilty to encode and compress data is absolutely amazing.  As mentioned in the last post, this becomes a challenge when deleting data.  Since files are never altered Vertica uses "delete vectors" which are markers as to which rows in which files should be discarded during query execution.  At certain intervals a background process will rewrite files to purge data that was deleted.  

Summary

This post was a quick overview of Vertica's physical architecture.  Hopefully you learned about some really cool features like buddy projections that guarantee that your data is safe even if you lose a few nodes.  Let's say you are a SQL Server Guy, like most of my readers.  You may be wondering why you would want to learn about Vertica if you have no plans to implement it.  I find it fascinating to learn about competing technologies to help me understand limitations in the solutions I work with that I didn't previously understand.  This makes us all a bit more well-rounded.  In the next post we'll cover installing Management Console which is the GUI for Vertica.  


You have just read "Vertica Physical Architecture" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

Tags: 

Vertica Objects

It's been a few weeks since I covered any Vertica topics, the last being Installing Vertica 7 .  Before we create our first Vertica database it's important to understand the core Vertica objects that we will eventually interact with.  Remember that Vertica data is stored in a column-oriented manner.  It is optimized for read-only data warehouse-style applications.  

Logical Modeling

You do your logical modeling just like you would on any other relational platform. Vertica is relational, not hierarchical or network-modeled. From a physical implementation perspective, It has schemas, tables, views, RI constraints, etc that are all visible to your SQLers. There's no reason why a good data warehouse schema (star schema) that you would concoct for, say, SQL Server or Oracle, wouldn't logically be the same on Vertica.  

Physical Modeling

The difference with Vertica and other RDBMSs is at the physical modeling layer.   Physical modeling is quite a bit different than what you may be accustomed too. "Tables" are physical storage objects in the relational world. With Vertica a "table" is a collections of table columns called projections. Every table must have at least one projection that stores all of the table's columns' data. This is called a superprojection. Every table has one and only one superprojection.  

As mentioned above, you have schemas, tables, and constraints.  What you don't have is indexes.  Instead of indexes data is materialized directly into the necessary projections.  

More on Projections

So, remember that projections are really just groups of columns.  Almost everything we do in Vertica revolves around a projection.  

More simplistically, a projection (any projection) is persisted data, usually optimized. Projections are not always raw data like in an RDBMS table (but it is if it is the superprojection). Instead, a projection is more like a materialized view or a saved result set. It is optimized for reads and is automatically updated as data changes.  Vertica can encode and compress the data in projections since like data types are always stored next to each other.  Note that in the graphic to the left that the Name "column" is physical stored together and is the same data type.  This will encode and compress quite nicely.  

Just like a SQL Server partition or index, you don't query a projection directly (or a superprojection...from now on when I mention "projection" you can assume either a projection or superprojection unless I specifiy otherwise), you query a table. Vertica picks the best projection for you just as SQL Server would pick the best index for you.  

Think of a projection just like a SQL Server indexed view or an Oracle materialized view. It is legal to put just about anything in a projection that you would put into an indexed view. You can join tables/projections (but only on PK/FK cols and only as INNER JOINs...these are called pre-join projections), calculations and aggregations (as of a new feature in Vertica 7)...you can even put ORDER BY clauses which you can't generally do in a view in SQL Server. In fact, this is very much preferred.  

When modeling a projection, store columns together that are regularly accessed together to optimize IO.  This is called a column grouping.  

One exception that may seem odd at first is that a projection CANNOT have a WHERE clause.  The reason is quite simple.  Remember that a projection is really just a set of materialized columns.  You can't filter the columns otherwise the projection isn't a true representation of the underlying table.  To overcome this the conventional Vertica wisdom is to ensure that often-used predicates are contained in leading columns of your projections. 

Segmentation

Small projections (lookup tables for instance) are replicated to every node in your cluster. They are not segmented, they are unsegmented projections.

Large projections are segmented (similar to SQL Server partitioning, or maybe the better analogy is "sharding" which SQL Server does not have natively) and the segments are smartly copied to other cluster nodes for redundancy and peformance.  

Segmentation is handled by built-in hashing and there are multiple algorithms available to you to do this. The "segmentation key" should have a high cardinality (like PKs) to be efficient otherwise the data will be skewed to certain segments/nodes. You will never achieve perfect segmentation so some skewing will always occur.  No need to over-engineer this.  

In the next post I'll cover how segmentation and hashing and something called "buddy projections" ensure that your data is smartly copied to other nodes to increase availability, recoverability, and performance.  

Table Partitioning

Vertica also has table partitioning.  Conceptually this is similar to SQL Server's partitioning.  Partitioning simply divides a large table into smaller pieces.  It applies to all projections for that table. Partitioning is generally used for fast data purging and query performance. It will segregate data on each node. Partitions can be dropped quickly and switched in and out just like SQL Server.  

DELETEs (and UPDATEs which are really DELETEs followed by INSERTs) are cumbersome for a columnstore like Vertica.  In a rowstore a given row is generally physically contiguous on disk.  Not so with a columnstore.  Therefore Vertica handles this by using delete vectors which are simply markers that a row(s) should be removed from any result sets.  This means that data can be as compact as possible on disk without any risk of fragmentation.  There are background "Tuple Mover" processes that actually rewrite the underlying files (projection data) to remove deletes.  Therefore, it's best to avoid DELETE processing as much as possible and this is where smart partitioning can help.  

So, other than as an alternative to deleting large chunks of data, what is the purpose of table partitioning?  Same as SQL Server.  If your query is written correctly you can benefit from "partition elimination" where Vertica does not even need to scan a given partition if it knows none of that partition's data will qualify for the result.  

Partitioning and Segmentation Working Together

Segmentation is defined on the projection and is used to gain distributed computing in Vertica.  That is its sole purpose.  Different projections for the same table will have identical partitioning but can be segmented differently.  Partitioning is declared on the CREATE TABLE statement.  Segmentation is declared on the CREATE PARTITION statement.  

Views

Vertica has the concept of a view or a saved query. A view will dynamically access and do any necessary computations from the underlying data at execution time. Unlike a projection, a view is not materialized and stores no data itself. A view does not need to be refreshed, unlike a projection, whenever underlying table data changes. A view can reference tables (and temp tables) and other views. Unlike most relational products, views in Vertica are read-only.  

Licensing and Projections and Segments

When you use the Community Edition of Vertica you only get 3 nodes and 1TB of data.  The definition of "data" is important since you can have a small amount of base table (superprojection) data but tons and tons of projections.  Data, in terms of licensing, is the base table data ONLY.  If you begin to run low on available space (in a future post I'll show you how to determine this and stay in license compliance) you can always drop partitions or delete data.  

Summary

In a future post we'll actually get to work hands-on with all of this stuff.  For now I think it's just important to understand key Vertica concepts and how they compare with SQL Server.  


You have just read "Vertica Objects" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

Tags: 

Vertica Certification

I have just received news that I am now an HP ATP certified Big Data Solutions architect for Vertica.  I have been using Vertica off and on for a few years now but have really ramped up my exposure over the past year.  A few months ago I began a blog series on getting to know Vertica but I haven't had the time to continue that series.  Now that I am doing study for the Vertica exam I hope to continue sharing that knowledge on my blog.  

 

 

Vertica on AWS

In the last post I covered the Vertica Installation Prereqs.  If you really don't want to install Vertica yourself you can always download a fully-functional VM from HP on the myVertica portal.  

Or you can run Vertica on Amazon in AWS.  I won't cover this in detail since AWS costs money and I'm frugal. Also Vertica is a *bit* different on AWS then when run natively. I have a slight bit of knowledge on running Vertica on AWS. I'll briefly document what I know in case you decide to run Vertica on AWS instead of in-house.  Running in the cloud is an excellent choice if you are an ISV and wish to package Vertica (probably the Community Edition) with your application.  

  • You'll need to be familiar with Amazon Management Console since most of the work is done from there.  
  • You specifiy the HP Vertica AMI you want to install using install_vertica.
  • You can still use Community Edition with 3 nodes and 1TB of data on AWS gratis.
  • You cannot create a cluster on AWS using MC (or some other things cluster-related). Instead use the Amazon EC2 console.  

You have just read "Vertica on AWS" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

Tags: 

Installing Vertica 7

In this post we are going to install Vertica 7.0.x from the command line for the first node, then look at how to install Vertica by using the Management Console (MC) in a future post, which is much easier and can be done for all subsequent nodes. The method outlined below is also the EXACT same process you follow when performing an upgrade. The only difference is prudence dictates that you would backup your existing databases PRIOR to upgrading.

Installing from the commandline

There is no apt-get for Vertica unfortunately.

  1. From your laptop (remember, Ubuntu server has no GUI) log in to the myVertica portal (accts are free) and download the HP Vertica package for the correct OS and processor architecture that you need.  
  2. I assume you are connecting to your Ubuntu box using PuTTY with SSH. If so you can use pscp to copy your downloaded .deb file to your server. You can google to download pscp if you are running stand-alone PuTTY.
  3. Open a DOS prompt (elevated) and cd to the location of pscp
  4. pscp <location of .deb file> <user>@<ip>:<path on ubuntu>
  5. You should see a progress indicator.  After the copy is complete you are ready to install the .deb package
  6. sudo dpkg -i <pathname>  .  this only takes a few secs. 
  7. You now have the Vertica package installed. Now you have to install Vertica.  This is a bit confusing, just remember that installing Vertica is a two-step process.  
  8. You now have to run the install script. This can be run for a single node or it can be simulataneously installed to many cluster nodes using --hosts. In other words, you only need to install the Vertica package once, on your first node. We will install on a single node and look at the other options for our subsequent dev nodes.
  9. sudo /opt/vertica/sbin/install_vertica --hosts 192.168.0.x --deb <path to deb> -L CE --accept-eula .  The ip_address can be an actual hostname as well. Don't use localhost or 127.0.0.1 else you can't add additional nodes to your cluster. This would be considered a "standalone, single node Vertica installation".  And there is no known way to fix it later.  The -L CE instructs the installer that you are using the Community Edition
  10. It's highly likely that you may see errors. Simply research and fix them one-by-one, re-running install_vertica as many times as it takes to resolve all issues.  Remember that you are running on Ubuntu and not Windows so expect some customizations.  I have some examples of errors at the right.  Eventually you'll get so far and need to ignore certain system requirement checks in install_vertica. For instance, ext3 and ext4 filesystems or LVM presence. If you have ext2 filesystems or LVM present (see first error above) then install_vertica will constantly fail.  It's OK to run a test Vertica instance on these filesystem configurations.  When you are comfortable that any remaining FAIL messages can be ignored, just add --failure-threshold NONE to the install_vertica command.  Never do that on a prod system, but it works for testing and playing with Vertica.
  11. At this point install_vertica should complete after a few minutes.  
  12. Logout
  13. login to Linux using the dbadmin acct that install_vertica created.  
  14. You can now play around. Try the following (or wait until my next blog post where we will do some additional Vertica fun):   /opt/vertica/bin/admintools

You have just read "nodetitle" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

Tags: 

Pages

Subscribe to RSS - Vertica