First Encounter with Massively Parallel Processing – AWS Redshift

Recently, I started working with large Ad tracking data sets (billions and billions of records in each table). We want to perform analytics randomly over that data and we need all the columns across which analytics could be performed. I started thinking if I have to store them into a data warehouse what are my options.

Can I store tables (fact & dimensions) in the old school MSSQL data warehouse and create cubes? First, I ran a simple query “select count(*) from…” against one of the existing table in our corporate data warehouse that contains 82 millions rows and 8 columns. It took 8 minutes to get me the result. Second, I thought how many cubes will we have to create as the analytics will be performed randomly in several different ways. Due to poor performance and  uncertainty in the ways analytics will be performed, I decided to try big data solution i.e. Redshift.

Redshift is Amazon Web Service’s MPP data warehouse solution. Here the MPP stands for Massively Parallel Processing. Below are the few things to note about MPP:

  1. Column Storage Technology: The data is stored column wise as opposed to row wise. It supports high compression and in-memory operations. These features give tremendous performance when you have to perform aggregation tasks.
  2. RDBMS: The database behind the scene stores the relational data.
  3. SQL: SQL is used as query language.
  4. Divide and Conquer: When a SQL query is executed then the query is first distributed among nodes. These nodes then get the results in parallel and a single consolidated result is presented.See diagram below for the reference architecture of the exercise.
  5. Huge Data set: The solution is useful for very large data sets.

mpp

The features distributed parallel  processing, column storage, high compression and in memory operations instantly attracted me. I couldn’t resist myself to explore and experiment further.

So, I provisioned Redshift cluster with single node with lowest available node configuration (dc1-large)  through AWS console. I created an ETL (console application using c#) to upload data set for one table. The data for this table was stored in several .csv files and resided in S3 bucket. I used Redshift’s  COPY command. COPY command is one of the most efficient way to upload data in Redshift. It took approximately 2 hours to upload 250 million rows across 102 columns in the table. That’s amazingly great performance using single node.

As I have data in the data warehouse, it was time to run a basic query “select count(*) from…”.  I used SQL workbench to run queries. Before I blinked my eyes, the result was there. It took less than 2 sec to generate the count. It was super fast. 

Now, its time to run a query that does some aggregations. I created a query with ‘group by’ on few columns and aggregated a measurable column (Count). Then I ran the query. The result appeared in 5m 3s. Remember, I provisioned cluster with single node. The result was fast but I was not satisfied with the timing.

Amazon claims that the performance will be 2x faster if provisioned with 2 nodes. It was time to see MPP in action. I changed the number of nodes to 2. Then I ran the same query. Guess what, Amazon’s claim was right. I got result in 2m 27s. That’s spectacular.

What if I provision with 4 nodes. Would the performance be 4x faster? And yes, it was. It took 1m 21s to generate the same exact result from the same exact query. Mind blowing.

AWS Advantages:

It was so easy to provision nodes and use MPP service provided by AWS. From economy of scale point of view, we can provision more resources whenever we need with pay as you go basis. AWS charges $0.25 per hour per node.  Here is the pricing link.

Competitors:

There are several alternatives available in the market but, in my opinion Microsoft Azure- SQL data warehouse is similar cloud based option available in this space. You might want to consider another emerging option which is Google Big Query.

Finally, I was amazed with the performance and the way I was able to use power of MPP easily and economically through AWS. At last, I want to mention, if you are more into performing row based operations (straight select * from) instead aggregations and not working with huge data sets then Redshift is not the solution for you.

Advertisements

IScheduler Resolve Issue with Quartz.Net & Unity

Today I was creating a job scheduler using Windows service, Quartz.Net and Unity. It is a killer combination and provides graceful application setup for job scheduler. Not to mention, it works like a charm. But, unfortunately I came across a painful issue while resolving Quartz’s IScheduler interface. For several hours, I looked all over the internet for the solution but I couldn’t find any solution. Hence, I am writing this post.

Just to give you a quick background of application environment, I was using following framework and packages in my service. I installed them from Nuget Packge Manager.

  1. .NET Framework 4.5.2
  2. Quartz.Net 2.3.3
  3. Quartz.Unity 1.4.2
  4. Unity 4.0.1

In order for dependency injection to work in the scheduler applications built with Quartz, you have to resolve IScheduler using Unity. You can not directly instantiate StdSchedulerFactory() and GetScheduler (see here for normal set up). I wouldn’t go into much details about what each line of code does but below is the code snippet one should write.

Container.AddNewExtension<QuartzUnityExtension>();

var scheduler = Container.Resolve<IScheduler>();

scheduler.Start();

At the line #2 program fails. Briefly, it says:

Resolution of the dependency failed, type = “Quartz.ISchedulerFactory”, name = “”(none)”.

The issue looks like the picture below:

Exception

 

Why this issue, what I did wrong and how to resolve this?

I looked through what AddNewExtension is doing. Tried to manually register Scheduler Factory. Tried many things. Then, I looked through each and every package description, their versions & dependencies on Nuget website. And then when I looked at different versions available for dependent assemblies. Finally, I found the culprit assemblies:

  1. Common.Logging
  2. Common.Logging.Core

I found that the version I had in references was 3.0.0 for both the assemblies. I updated these assemblies with the latest version (3.3.1). Bingo! the application worked.

So what happened? The latest package of Quartz.Net is compiled with old version (3.0.0) of Common.Logging and Common.Logging.Core. When I installed Quartz.Net the old version of these assemblies ended up in my references.

Unfortunately, the exception didn’t mention anything about Common.Logging. One can argue that using another DI framework like Castle Windsor or drop Quartz from solution could be easier. But you never know when all of the sudden you would encounter such unrelated problem.

Yet another day. Hope this post will help.

 

Test Driven Development With xUnit.Net

xunit-dot-net

A famous quote “Prevention is better than cure” not only applies to the healthcare world, but in fact, it applies everywhere, even while developing software.

What I mean from above is, while developing software knowing scenarios & bugs in early stage (while coding) is way better than knowing them later in SDLC (Software development life cycle). Now a days Test driven development (TDD) and Test First approach is the modern way of developing high quality software in agile environment. TDD and TF allows programmers to think about all possible scenarios before and while writing code. These cases can work as granular functional spec. It not only ensures quality  but also give developers a sense of ownership of functionality they are building.

There are several tools and frameworks available to incorporate TDD in development process. This blog is dedicated towards xUnit.Net tool.

xUnit is a unit testing tool for the .NET framework. It is created by nUnit creators. It is built to address failures and shortcomings observed in nUnit. And, it incorporates success observed in nUnit. xUnit demonstrates great ability to fit closely with .NET platform. It is free, open source and licensed under Apache (Version 2.0).

There are several features that distinguish xUnit from other unit testing tools like nUnit and MS Test. Following are the main features I would like to cover:

  1. Single object instance per test method: It allows complete isolation of test methods that allow developers to independently run tests in any order.
  2. No Setup & Teardown attributes: In nUnit you can create methods to run before and after an actual test method runs. These methods are decorated with attributes named SetUp and TearDown respectively. The down side is that it unnecessarily creates confusion and make developers to hunt around the presence and absence of such methods. It is more productive to write code set up and tear down code within test methods itself. Hence, in xUnit.Net pink slip is given to these unproductive attributes.
  3. No Support for ExpectedException attribute: This is another attribute to whom xUnit has said “Good bye”. The reason is expected exceptions could be thrown from wrong place in code that could potentially pass the test while it was supposed to be failed. The best way is that the developers handle these expected exceptions within the test method  itself. This approach provides better control over writing arrange, act, assert and handling exceptions.
  4. Reduced set of attributes:  To make the framework simple in use xUnit has less number of attributes as compared with other tools. For example, unlike nUnit it doesn’t require [TextFixture] & [TestMethod] attributes to declare a test while it requires only [Fact] attribute decorated on test method.
  5. Automation: xUnit has great ability for test automation and can smoothly work in conjunction with other testing framework. It is highly extensible and open for customization.

In addition to the above features and differences with other tools, click here to get full comparison between xUnit vs nUnit vs Ms Test.

Simple Demo: The tests can run in several ways using:

  • Visual studio 2013 test explorer/runner
  • Resharper (#R)
  • Commad line
  • TestDriven.NET and so on

I created a simple demo to demonstrate the following topics:

  1. Installation of xUnit
  2. Creation of test using simple [Fact] attribute and Assert
  3. Nuget packages and extensions involved
  4. Run test using visual studio test explorer/runner and Resharper.

Lastly, Quality matters. Mind It.

Download source code here