RabbitMQ - SQL Server

February 11, 2017, 2:39 am

A week or two ago I came across a post on the RabbitMQ Users forum, about how to communicate from SQL Server to RabbitMQ. Seeing that we do that a lot at Derivco, I came with some suggestions, and also said that I was writing a blog-post about how it can be done. The part of writing was not entirely correct - at least not until now (sorry Jim, it's been hectic at work).

SQL Server is awesome, and it is super easy to get data into the database. Getting data out is easy as well by querying the database. What is a bit tricker though is to get data out at the time the data is inserted or updated. Think real-time events; a purchase is made - someone needs to be notified about it at the very second it happens. We can arguable say that the data we are interested in should not be pushed out from the database, but from somewhere else. Sure, that is true - but quite often you don't have a choice.

This was the case for us - we needed to send events out of the database for some further processing, and the question was how to do it.

SQL Server and External Communication

There has been a couple of initiatives in SQL Server to allow for communication out of the database; SQL Server Notification Services (NS), which was introduced in SQL Server 2000, and more recently SQL Server Service Broker (SSB) introduced in SQL Server 2005. I covered both NS and SSB in the book I wrote together with Bob Beauchemin and Dan Sullivan - A First Look at SQL Server 2005 for Developers. NS was introduced in SQL Server 2000, as I mentioned, and had an overhaul in the beta releases for SQL Server 2005. However, NS was cut before SQL Server 2005 went RTM.

NOTE: If you read the book, you can find us covering a couple of features that never made it to RTM.

SSB survived, and in SQL Server 2008 Feature Pack, Microsoft introduced Service Broker External Activator (EA). This is a way through SSB to be able to communicate outside of the local database. The theory sounds good, but in reality, it is cumbersome and convoluted. We did some tests with it, but we quickly realized it didn't do what we wanted. Also, SSB did not give us the performance we needed, so we had to come up with something else.

SQLCLR

What we came up with was based on SQLCLR. SQLCLR is where the .NET framework is embedded in the SQL engine, and you execute .NET code in the SQL Server process. Since you execute .NET code, you can almost do anything you can in a "normal" .NET application.

NOTE: I wrote "almost" above, as there actually are certain limitations. For this discussion however, the limitations have almost no impact on what we want to do.

The way SQLCLR works is that you compile your code into a dll, and you then register the assembly with SQL Server:

Create Assembly

CREATEASSEMBLY[RabbitMQ.SqlServer]
AUTHORIZATIONrmq
FROM'F:\some_path\RabbitMQSqlClr4.dll'
WITHPERMISSION_SET=UNSAFE;
GO

Code Snippet 1:Creating an Assembly from Absolute Path

The code does the following:

CREATE ASSEMBLY - creates an assembly with a given name (whatever you want it to be).
AUTHORIZATION - indicates the owner of the assembly. In this case rmq is a pre-defined SQL Server role.
FROM - defines where the original assembly lives. The FROM statement can also take an UNC patch or a binary definition of the assembly. The installation files for this project uses the binary representation.
WITH PERMISSION_SET - sets the permissions. UNSAFE is the least restrictive, and needed in this case.

NOTE: Whatever role/login the AUTHORIZATION sets, will also cause an appdomain to be created with that name, as well as loading the assembly into that domain, when the assembly is loaded. It is good practice to try and keep your assemblies separated in different appdomains to prevent an error in one assembly taking down multiple assemblies. If the assemblies have dependencies on each other however, they cannot be separated in different appdomains.

When the assembly is created you create wrappers around the .NET methods in your assembly:

Create Wrapper Procedures

CREATEPROCEDURErmq.pr_clr_PostRabbitMsg@EndpointIDint,@Messagenvarchar(max)
AS
EXTERNALNAME[RabbitMQ.SqlServer].[RabbitMQSqlClr.RabbitMQSqlServer].[pr_clr_PostRabbitMsg];
GO

Code Snippet 2:.NET Method Wrappers

The code above does:

Create a T-SQL stored procedure named rmq.pr_clr_PostRabbitMsg, and it takes two parameters; @EndpointID and @Message.
Instead of having a body, the procedure is created against an external source, which consists of:
- An assembly named RabbitMQ.SqlServer, i.e. the assembly we created above in Code Snippet 1.
- A fully qualified type (name-space and class): RabbitMQSqlClr.RabbitMQSqlServer
- A method in the above name-space and class: pr_clr_PostRabbitMsg.

When executing the procedure rmq.pr_clr_PostRabbitMsg, the method pr_clr_PostRabbitMsg will be called.

NOTE: When creating the procedure, the assembly name is not case sensitive, however the fully qualified type and method name are. There is no requirement that the procedure name being created are the same as the method name. However, the eventual data types for the parameters have to match.

As I mentioned before, at Derivco we have the requirement to send data out of SQL server, and for that we use SQLCLR and RabbitMQ (RMQ).

RabbitMQ

RMQ is an open source message broker software that implements the Advanced Message Queuing Protocol (AMQP), and is written in the Erlang programming language. As the RMQ is a message broker, you need AMQP client libraries to connect to the broker. Your application references the client libraries, opens a connection and send messages - think ADO.NET to SQL Server. As opposed to ADO.NET, where you most likely opens a connection each time you communicate with the database, you will keep the connection open for the lifetime of the application.

So, in order to be able to communicate from the database to RabbitMQ we need an application and the .NET client library for RabbitMQ.

NOTE: In the rest of this post three will be some code snippets showing RabbitMQ code, but there won't be much explanation what they do. If you are new to RabbitMQ I suggest you have a look at the various RabbitMQ Tutorials, to get a feel for what the code is doing. The Hello World tutorial for C# is a good place to start. One thing different between the tutorials and the demo code is that in the demo code no exchanges are declared. The expectation is that they are pre-defined.

RabbitMQ.SqlServer

RabbitMQ.SqlServer is an assembly that uses the .NET client library for RabbitMQ and exposes functionality to post messages from the database to one or more RabbitMQ endpoints (VHosts and Exchanges). The code can be downloaded/forked from my GitHub repository RabbitMQ-SqlServer. It contains source code of assemblies as well as install files (so you don't have to build from source).

NOTE: This is demo code, to give an idea how SQL Server can call RabbitMQ. This is NOT production ready code in any shape or form. If the code burns down your house and kills you cat - don't blame me - it is DEMO code.

Functionality

When the assembly is being loaded, either due to en explicit call to initialize it - or implicit by calling wrapper procedure, the assembly loads the connection-string to the local database where the assembly is installed, as well as RabbitMQ endpoints, to which it also connects:

Connecting

internalboolInternalConnect()
{
try
{
connFactory=newConnectionFactory();
connFactory.Uri=connString;
connFactory.AutomaticRecoveryEnabled=true;
connFactory.TopologyRecoveryEnabled=true;
RabbitConn=connFactory.CreateConnection();
for(intx=0;x<channels;x++)
{
varch=RabbitConn.CreateModel();
rabbitChannels.Push(ch);
}
returntrue;
}
catch(Exceptionex)
{
returnfalse;
}
}

Code Snippet 3:Connection to an Endpoint

While part of connecting to the endpoint it also creates IModels on the connection, and these are used when posting messages:

Posting Message

internalboolPost(stringexchange,byte[]msg,stringtopic)
{
IModelvalue=null;
intchannelTryCount=0;
try
{
while((!rabbitChannels.TryPop(outvalue))&&channelTryCount<100)
{
channelTryCount+=1;
Thread.Sleep(50);
}
if(channelTryCount==100)
{
varerrMsg=$"Channel pool blocked when trying to post message to Exchange: {exchange}.";
thrownewApplicationException(errMsg);
}
value.BasicPublish(exchange,topic,false,null,msg);
rabbitChannels.Push(value);
returntrue;
}
catch(Exceptionex)
{
if(value!=null)
{
_rabbitChannels.Push(value);
}
throw;
}
}

Code Snippet 4:Sending a Message Using an IModel

The Post method is called via the method pr_clr_PostRabbitMsg(int endPointId, string msgToPost) which is exposed as a procedure to the CREATE PROCEDURE statement in Code Snippet 2:

Method to Call Post

publicstaticvoidpr_clr_PostRabbitMsg(intendPointId,stringmsgToPost)
{
try
{
if(endPointId==0)
{
thrownewApplicationException("EndpointId cannot be 0");
}
if(!isInitialised)
{
pr_clr_InitialiseRabbitMq();
}
varmsg=Encoding.UTF8.GetBytes(msgToPost);
if(endPointId==-1)
{
foreach(varrepinremoteEndpoints)
{
varexch=rep.Value.Exchange;
vartopic=rep.Value.RoutingKey;
foreach(varpubinrabbitPublishers.Values)
{
pub.Post(exch,msg,topic);
}
}
}
else
{
RabbitPublisherpub;
if(rabbitPublishers.TryGetValue(endPointId,outpub))
{
pub.Post(remoteEndpoints[endPointId].Exchange,msg,remoteEndpoints[endPointId].RoutingKey);
}
else
{
thrownewApplicationException($"EndpointId: {endPointId}, does not exist");
}
}
}
catch
{
throw;
}
}

Code Snippet 5:Method to be Exposed as Procedure

When executing the method the assumption is that the caller sends in the id of the endpoint to which to send the message, as well as the actual message. If -1 is sent in as endpoint id, we loop through all endpoints and send it to all. The message comes in as a string and from the string we are getting bytes by using Encoding.UTF8.GetBytes. Calling Encoding.UTF8.GetBytes should in a production environment be replaced with some type of serialization.

Installation

In the src\sql folder are all files needed for installing and running this demo code. To install:

Run the 01.create_database_and_role.sql file. This creates:
- the test database RabbitMQTest where the assembly will be created.
- a SQL Server ROLE: rmq, which will own the assembly when it is created.
- a SCHEMA, also called rmq. The various database objects are created in this schema.
Run 02.create_database_objects.sql. This script creates:
- a table rmq.tb_RabbitSetting, in which the local database connection string will be stored.
- a table rmq.tb_RabbitEndpoint, in which one or more RabbitMQ endpoints is stored.
Edit the @connString variable in 03.create_localhost_connstring.sql to the correct connection string for the RabbitMQTest database created in step 1. Then run the script.

Before continuing, you need to have an RabbitMQ broker up and running and a VHost (the default VHost / will do). We tend to have quite a few VHosts, purely for isolation purposes. The VHost also needs an exchange, in the demo code we use amq.topic. When you have a RabbitMQ broker you edit the parameters in the rmq.pr_UpsertRabbitEndpoint procedure which is in the 04.upsert_rabbit_endpoint.sql file:

RabbitMQ Endpoint

EXECrmq.pr_UpsertRabbitEndpoint@Alias='rabbitEp1',
@ServerName='RabbitServer',
@Port=5672,
@VHost='testHost',
@LoginName='rabbitAdmin',
@LoginPassword='some_secret_password',
@Exchange='amq.topic',
@RoutingKey='#',
@ConnectionChannels=5,
@IsEnabled=1

Code Snippet 6:Creating a RabbitMQ Endpoint

At this stage it is time to deploy the assemblies. What we deploy is different if the SQL Server version is pre SQL Server 2014 (2005, 2008, 2008R2, 2012), or version 2014 or later. The difference comes what version of CLR is supported. Pre SQL Server 2014, the .NET framework ran on version 2 of CLR, whereas for SQL Server 2014, and later, it is version 4.

SQL Server 2005 - 2012

Let us start with the SQL server versions that run on CLR 2, as it is not completely straight forward what to do. We know that we need to deploy the assembly we wrote, and we also need to deploy the .NET client library assembly for RabbitMQ (RabbitMQ.Client). The RabbitMQ client library should be referenced from our assembly. As we target CLR 2, our assembly as well as the RabbitMQ.Client need to be compiled against no later .NET version than 3.5. This is where there are some issues.

All the later versions of the RabbitMQ.Client libraries are compiled against CLR 4, which means they cannot be used by our assembly. The latest version of the client libraries compiled against CLR 2 is 3.4.3. However, if we try to deploy that assembly we get a "nasty" error message:

Figure 1:Missing Assembly System.ServiceModel

This version of the RabbitMQ.Client references an assembly which is not part of CLR inside SQL Server. It is a WCF assembly, and when I - above - wrote of certain limitations in SQLCR, this is one of them: that particular assembly are for all intents and purposes not allowed to run within SQL Server. The latest versions of RabbitMQ.Client does not have this reference, and can therefore be used without any issues - apart from that "pesky" CLR 4 requirement. What to do?

Well, RabbitMQ is open source, and we are developers so, let's recompile from source. Before the latest releases of RabbitMQ.Client (i.e for versions <3.5.0) I removed the System.ServiceModel reference, and re-compiled. I did have to change a couple of lines of code which used functionality from System.ServiceModel, but it was minor changes.

For this demo code I did not use the 3.4.3 version, but grabbed the 3.6.6 stable release, and recompiled using .NET 3.5 (CLR 2). That almost worked :), except for that later releases of RabbitMQ.Client uses Task's, which are not part of .NET 3.5 natively.

Fortunately there is a version of System.Threading.dll for .NET 3.5 which includes Task. I downloaded that, referenced it and all is good! The implication this has is that System.Threading.dll need to be installed as well as an assembly.

NOTE: The source for the RabbitMQ.Client from which I built the .NET 3.5 version is in my RabbitMQ Client 3.6.6 .NET 3.5 GitHub repository. The compiled dll together with the System.Threading.dll for .NET 3.5 is also in the lib\NET3.5 folder in the demo code (RabbitMQ-SqlServer) repo.

To install the necessary assemblies (System.Threading, RabbitMQ.Client, and RabbitMQ.SqlServer) run the install scripts from src\sql, in following order:

05.51.System.Threading.sql2k5-12.sql - System.Threading
05.52.RabbitMQ.Client.sql2k5-12.sql - RabbitMQ.Client
05.53.RabbitMQ.SqlServer.sql2k5-12.sql - RabbitMQ.SqlServer

SQL Server 2014+

For SQL Server 2014 and later, you compile your assembly as .NET 4.xx (the demo code is 4.5.2), and you can reference any of the later RabbitMQ.Client versions, which you can get from NuGet. For the demo code I use 4.1.1. The RabbitMQ.Client is also in the lib\NET4 folder in the demo code (RabbitMQ-SqlServer) repo.

To install you run the install scripts from src\sql, in following order:

05.141.RabbitMQ.Client.sql2k14+.sql - RabbitMQ.Client
05.142.RabbitMQ.SqlServer.sql2k14+.sql - RabbitMQ.SqlServer

SQL Method Wrappers

To create procedures that can be used from created assemblies (3.5 or 4), you run the 06.create_sqlclr_procedures.sql script which creates T-SQL procedures for the three .NET methods:

rmq.pr_clr_InitialiseRabbitMq calls pr_clr_InitialiseRabbitMq. Used to load and initialize the RabbitMQ.SqlServer assembly.
rmq.pr_clr_ReloadRabbitEndpoints calls pr_clr_ReloadRabbitEndpoints. Loads the various RabbitMQ endpoints.
rmq.pr_clr_PostRabbitMsg calls pr_clr_PostRabbitMsg. Used to post a message to RabbitMQ.

The script also creates a "pure" T-SQL procedure - rmq.pr_PostRabbitMsg - which executes against the rmq.pr_clr_PostRabbitMsg. This procedure is a "wrapper", which knows what to do with the data and handles errors etc. In production we have multiple procedures like this, handling different types of messages. More about that in Usage below.

Usage

From all of the above we can see that to post a message to RabbitMQ we call the rmq.pr_PostRabbitMsg or rmq.pr_clr_PostRabbitMsg with an endpoint id and the message in string format. That is cool, but how would it be used in the "real world".

What we do in the production systems - in the stored procedures that processes data and the data should be sent to RabbitMQ - is to build up the data we want to send and then in a "hook point" we call the equivalent of rmq.pr_PostRabbitMsg. Below is a very simplified example of such a procedure:

Processing Procedure

ALTERPROCEDUREdbo.pr_SomeProcessingStuff@idint
AS
BEGIN
SETNOCOUNTON;
BEGINTRY
--create a variable for the endpoint
DECLARE@endPointIdint;
--create a variable for the message
DECLARE@msgnvarchar(max)='{'
--do important stuff, and collect data for the message
SET@msg=@msg+'"Id":'+CAST(@idASvarchar(10))+','
-- do some more stuff
SET@msg=@msg+'"FName":"Hello",';
SET@msg=@msg+'"LName":"World"';
SET@msg=@msg+'}';
--do more stuff
-- get the endpoint id from somewhere, based on something
SELECT@endPointId=1;
--here is the hook-point
--call the procedure to send the message
EXECrmq.pr_PostRabbitMsg@Message=@msg,@EndpointID=@endPointId;
ENDTRY
BEGINCATCH
DECLARE@errMsgnvarchar(max);
DECLARE@errLineint;
SELECT@errMsg=ERROR_MESSAGE(),@errLine=ERROR_LINE();
RAISERROR('Error: %s at line: %d',16,-1,@errMsg,@errLine);
ENDCATCH
END

Code Snippet 7:Procedure with Hook-point

In Code Snippet 7 we see how the data to be sent is captured in the procedure, and when processing is done the data is sent. To use this procedure execute the 07.create_processing_procedure.sql script in the src\sql folder.

Make it All Happen

At this stage you should now be ready to send some messages. Before you test it, make sure you have a RabbitMQ queue bound to the exchange your end-point in rmq.tb_RabbitEndpoint points to.

So, to test this:

Open the script file 99.test_send_message.sql.
Execute EXEC rmq.pr_clr_InitialiseRabbitMq;, to initialize the assembly and load the RabbitMQ endpoints. This is not necessarily required, but it is good practice to "pre-load" the assembly after it has been created or altered.
Execute EXEC dbo.pr_SomeProcessingStuff @id = 101 (use any id value you want).

If no errors happened you should now see a message in your bound queue in the RabbitMQ broker! Congratulations, you have now used SQLCLR to send a message to a RabbitMQ broker.

If you have comments, questions etc., please comment on this post or ping me.

↧

Interesting Stuff - Week 6

February 11, 2017, 8:32 pm

≫ Next: Interesting Stuff - Week 7

≪ Previous: RabbitMQ - SQL Server

Throughout the week, I read a lot of blog-posts, articles, etc., that has to do with things that interest me:

data science
data in general
distributed computing
SQL Server
transactions (both db as well as non db)
and other "stuff"

This is the "roundup" of the posts that have been most interesting to me, this week.

This week has been somewhat hectic work-wise, so I have not read as much as I have wanted, but this is what I found.

Streaming

Apache Flink Community Announces 1.2.0 Release. Flink is a high-performing stream processing framework. They have now released version 1.2, which adds really exciting new functionality to the engine.
Hazelcast Release Jet, Open-Source Stream Processing Engine. While we are on the subject of stream processing engines, this article is about a new stream processing engine that has some new innovative thinking in the way it works.

Distributed Computing

Virtual Panel: Microservices in Practice. Panel discussion about the state of art of Microservices, and how they are likely to evolve.

SQL Server

Extreme 25x compression of JSON data using CLUSTERED COLUMNSTORE INDEXES. In last weeks roundup, I pointed out a post by Jovan Popovic about JSON data and Clustered Column Store Indexes. This weeks post drills further into it and shows how you can get really impressive compression of the data.
Exporting tables from SQL Server in json line-delimited format using BCP.exe. More by Jovan. This time how SQL Server can be used to export content of tables into line-delimited JSON format.
SQL Server First Responder Kit. Through an article in InfoQ, I came across this very handy tool for anyone that has to do any kind of work with SQL Server.

Data Science

Build an intelligent app with SQL Server and R. By now it should be pretty clear that SQL Server 2016, has some very impressive capabilities when it comes to Data Science. This post outlines how to get started and building a predictive model, using SQL Server 2016 and R.
Retail customer analytics with SQL Server R Services. More about SQL Server and R. This time about analytics of retail customers.
Machine Learning Your Way to Smarter API Error Responses. Presentation about how Machine Learning can be used to help you understand malformed API requests and to be able to respond with a best fit response, as well as capturing the user errors for future responses.
Machine Learning and End-to-End Data Analysis Processes in Spark Using Python and R. Presentation by Debraj GuhaThakurta from the Microsoft Azure Machine Learning group, where he talks about machine learning and data analysis processes in Spark using Python and R.

Big Data and Data Lakes

Load Data from Azure Data Lake into Azure SQL Data Warehouse at 3TB/Hour. Post about how to use SQL Server Data Warehouse PolyBase support to load data from Azure Data Lake Storage into SQL Server Data Warehouse.

Shameless Self Promotion

So this is my shameless self promotion part, where I point out posts I have written etc.

RabbitMQ - SQL Server. Post about how to send data from SQL Server to RabbitMQ.
satRday - Cape Town. This is the second satRday conference ever - worldwide! My talk is about Microsoft R Server, and how it compares to CRAN R. I do believe there are still available seats, so come by and say Hi!

That's all for this week. I hope you enjoy what I did put together. If you have ideas for what to cover, please comment on this post or ping me.

↧

Interesting Stuff - Week 7

February 19, 2017, 8:48 am

≫ Next: Microsoft R Server

≪ Previous: Interesting Stuff - Week 6

Throughout the week, I read a lot of blog-posts, articles, etc., that has to do with things that interest me

data science
data in general
distributed computing
SQL Server
transactions (both db as well as non db)
and other "stuff"

This is the "roundup" of the posts that has been most interesting to me, this week.

This post is little late as I was in Cape Town during the weekend and gave a talk at satRday. The conference was really good, great job by Andrew Collier for arranging this. During the week I'll put the code for my talk up on GitHub.

Data Science

6 Deep Learning Applications a beginner can build in minutes (using Python). Interesting article trying to de-mystify Deep Learning.
Deep Learning in R. More about Deep Learning. This talks about various R packages for Deep Learning.
Real-World, Man-Machine Algorithms. This article, which is part of the InfoQ series An Introduction To Machine Learning, talks about the end-to-end flow of developing machine learning models: where to get training data, how to pick the ML algorithm, and so forth.
Performance improvements coming to R 3.4.0. Talks about what can be expected in the new R release 3.4, scheduled for March.
RedQueen: An online algorithm for smart broadcasting in social networks. From the morning paper. This is about algorithms can be used to find the optimal time to post on social networks.

Streaming

Spark is the Future of Analytics. Interesting analysis of Spark.
Kafka Streams - how does it fit the stream processing landscape?. Post about Kafka Streams, a library for transforming and combining data streams in Kafka.
Towards a real-time streaming architecture. How Sky Betting & Gaming went with Kafka for real-time streaming.
User Activity Events using Kafka Streams. More about Kafka and Kafka streams. How to enrich an incoming stream of events with side data, and then compute aggregations based on the enriched stream.

Big Data & Databases

Petabytes Scale Analytics Infrastructure @Netflix. About Netflix' overall big data platform architecture, focusing on Storage and Orchestration.
Spanner, the Google Database That Mastered Time, Is Now Open to Everyone. About Google Spanner, a database that can span multiple geo-locations and still be seen as one instance.

That's all for this week. I hope you enjoy what I did put together. If you have ideas for what to cover, please comment on this post or ping me.

↧

Microsoft R Server

February 24, 2017, 8:28 pm

≫ Next: Interesting Stuff - Week 8

≪ Previous: Interesting Stuff - Week 7

Last Saturday (February 18) I was in Cape Town at the second world-wide satRday conference ever, where I gave a talk named: "Microsoft, Open Source, R: You gotta be kidding me!". The talk was about Microsoft's R Server offering, and how it in some cases offered a better performance than Open R. Seeing the session lengths were only 20 minutes, I could not show any code, so in a "weak" moment I promised to put the code up on my web-site together with an accompanying blog-post. This is it :).

Demo Code - Scenario, Installation and Data

This blog-post has some accompanying demo code, which we will use in order to compare R with Microsoft R Server. The code can be downloaded from here.

Scenario

We are using simulated mortgage data for a 10 year period where the data contains information about mortgages and if they were defaulted or not. All in all it is 10 million records, originally stored in .csv files, but the installation process inserts the data into a SQL Server table.

What we want to do is to retrieve the data from the database, and then create a model based on the data.

Installation

Unzip the downloaded microsoft_r_server.zip file to some location on your hard-drive.
Follow the install instructions in the index.html file.
Please remember to close the @path variable with a "\".

At this point you should now have a database, MortgageDb, with a table, dbo.tb_MortgageData containing 10 million records.

Data

To see what the data looks like, execute: SELECT TOP(5) * FROM dbo.tb_MortgageData, and you'll see something like so:

Figure 1:Mortgage Data

The data contains 6 variables (columns):

CreditScore - credit rating for the holder of the mortgage
HouseAge - how old the house is (years)
YearsEmp - number of years the mortgage holder has been employed at their current job
CreditCardDebt - how much debt the mortgage holder has on his (or hers) credit card
Year - the year the data was collected
DidDefault - binary variable indicating whether the mortgage holder defaulted (0 - no, 1 - yes)

The variables above are what we will use to create a model.

R

So onto R. R is awesome! It is no doubt about it, and it has become a defacto standard for advanced analytics. Figure 2 below is from IEEE Spectrum’s third interactive ranking of the most popular programming languages:

Figure 2:Language Popularity

During the last few years R has steadily climbed and is now in 5:th place, pushing C# down to 6:th! A lot of R's popularity can probably be attributed to its packages, where there seems to be packages for anything you want to do, plus some more. At the moment (late February 2017) the CRAN package repository features ~10,120 available packages (it increases by the day). To put the number of packages in perspective; in March 2016, there were ~8,000 packages available.

Issues

As great as R is, there are some shortcomings:

Data movement
Operationalization
Scale / performance

Data Movement

When you use R, you have to move data from the source to R (most likely your machine). Moving large data volumes over the network may not be ideal, and the security department may not be too happy either.

Operationalization

The data scientist has created the best model - ever! How do you now put this into production; do you have the data science guy retrieve data to predict upon to his (or hers) machine and run the model against live data, or what do you do?

Scale / Performance

A problem with R is that it is single threaded. Furthermore when working with data, all data has to be in memory. In todays world when we more and more want to analyze big data, this can become an issue.

Demo Code R

To showcase some issues with Open R, let us to create a model against our data in dbo.tb_MortgageData. So in our favorite R editor we probably write some code like:

Get Mortgage Data

# load in the ODBC library for R
library(RODBC)
# set up a connection
conn<-odbcDriverConnect(connection="Driver={SQL Server native Client 11.0};server=server_name;database=MortgageDb;uid=user_id;pwd=secret_password")
# read the data into a dataframe - mydata - this will take a while
mydata<-sqlQuery(conn,"SELECT CreditScore, HouseAge, YearsEmp, CreditCardDebt, Year, DidDefault FROM dbo.tb_MortgageData")
# treat HouseAge and Year as a categorical variable
mydata$HouseAge<-factor(mydata$HouseAge)
mydata$Year<-factor(mydata$Year)

Code Snippet 1:Load Mortgage Data into R

From Code Snippet 1 above you can see how we load all the 10 million rows of data into a data-frame. If you run this code yourself, you will notice it will take a while - but all 10 millions row will eventually be in memory.

After having read the data, we want to treat Year and HouseAge as a categorical/factor variables, so we use the factor function for that. At this stage we are now ready to create a model.

We do believe that a logistic regression model would be useful, where DidDefault is the dependent variable with CreditScore, YearsEmp, CreditCardDebt and the factor:ized HouseAge and Year as independent variables:

Logistic Regression

# this comes after the factoring of HouseAge and Year
logit<-glm(DidDefault~HouseAge+Year+CreditScore+
YearsEmp+CreditCardDebt,
data=mydata,family="binomial")

Code Snippet 2:Logistic Regression with glm()

Before you execute the above, have a look at task manager for how much memory you are consuming. On my development PC it looks like this:

Figure 3:Memory After Loaded Data

The memory consumption is very small, 25% overall. Let us see what happens when we execute the logistic regression as in Code Snippet 2.

NOTE: If you run this yourself, have a very close eye at the memory consumption, and be prepared to kill RStudio, when memory reaches 98 - 99%.

After a while, the logistic regression is still running and the memory is like below:

Figure 4:Memory During Logistic Regression

My development PC has 24 Gb of Ram and a couple of times when I have tested this, the PC has blue-screened, due to running out of memory. Other times the regression has, run and run and run - and I have finally killed the R Studio session after 10 - 15 minutes.

We have just seen an example where some of R's limitations are causing problems. In an enterprise scenario the above may cause issues, especially when we - in the enterprise - more and more are analyzing Big Data. In the example above it was 10 million rows of data, not really Big Data - but what do we do in these scenarios? Well, there are enterprise software vendors who have their own offering of enterprise R (obviously for a price), among them are Oracle, Tibco and up until early 2015 Revolution Analytics.

Revolution Analytics

Revolution Analytics is a statistical software company focusing on developing big data, large scale multiprocessor computing, and multi-core functionality version of R: Revolution R Enterprise. Both Teradata and IBM partnered with Revolution Analytics to provide analytical platforms for the enterprise.

In January 2015 Microsoft purchased Revolution Analytics and re-branded Revolution R Enterprise as Microsoft R Server.

Microsoft R Server

Microsoft R Server is next generation of Revolution R Enterprise server, and offers an enterprise class server for hosting and managing parallel and distributed workloads of R processes on servers (Linux and Windows) and clusters (Hadoop and Apache Spark). It extends open source R with support for high-performance analytics, statistical analysis, machine learning scenarios, and massively large datasets.

As mentioned above Microsoft R Server can run on both Windows as well as Linux, and in the Windows world, SQL Server is the delivery mechanism for Microsoft R Server.

Some of the key components of Microsoft R Server are:

DeployR - An integration technology for deploying R analytics inside web, desktop, mobile, and dashboard applications as well as backend systems.
ConnectR - High speed connectors to any data source ranging from simple workstation file systems to complex distributed file systems and MPP databases.
DistributedR - An adaptable parallel execution framework that includes services for communications, storage integration and memory management.
R Tools for Visual Studio - Turns Visual Studio into a powerful R development environment, including things like Intellisense!
ScaleR - Provides algorithms optimized for parallel execution on big data. These algorithms are optimized for transparent distributed execution, eliminate memory limitations and scale from laptops to servers to large clustered systems. Foundation for RevoScaleR.

RevoScaleR

RevoScaleR, is an R package providing both High Performance Computing (HPC) and High Performance Analytics (HPA) capabilities for R. HPC capabilities allow you to distribute the execution of essentially any R function across cores and nodes, and deliver the results back to the user. HPA adds big data to the challenge.

The data manipulation and analysis functions in RevoScaleR are appropriate for small and large datasets, but are particularly useful in three common situations:

To analyze data sets that are too big to fit in memory.
To perform computations distributed over several cores, processors, or nodes in a cluster,
To create scalable data analysis routines that can be developed locally with smaller data sets, then deployed to larger data and/or a cluster of computers.

In the demo code that follows we'll see how RevoScaleR is doing with our mortgage data from above.

Demo Code RevoScaleR

It is worth noting that in the RevoScaleR code, the editor I used is Visual Studio and R Tools for Visual Studio. To use Microsoft R Server and RevoScaleR you do not need to use Visual Studio, you can use any editor you want. Just make sure the editor uses the Microsoft R Server engine.

NOTE: To change engine in RStudio, go to Tools | Global Options, and under the R General tab, you change the R version as in Figure 5 below.

Figure 5:Changing R Version in RStudio

If you want to follow along in the code, and you are using Visual Studio (with R Tools for Visual Studio), you can open the solution file in the VS\VSMortgage folder from the unzipped file above. If you use RStudio or some other editor then just open the script.R file from the VS\VSMortgage folder, in your preferred editor. Once again, just make sure that your R engine is Microsoft R Server.

So, what does the code look like:

RevoScaleR Code

# set up a connection string
sqlServerConnString<-"Driver=SQL Server;server=server_name;database=MortgageDb;uid=user_id;pwd=secret_password"
# generate a data frame - notice the data won't be read into the frame until it is needed
mydata<-RxSqlServerData(sqlQuery="SELECT CreditScore, HouseAge, YearsEmp, CreditCardDebt, Year, DidDefault FROM dbo.tb_MortgageData",
connectionString=sqlServerConnString,
rowsPerRead=1000000)
# create a histogram
rxHistogram(~CreditScore,data=mydata);
# get some info about the data
rxGetInfo(mydata,numRows=5);

Code Snippet 3:Using RevoScaleR Package

The code does not differ that much from the original code. We start with defining a connection string to the database. Then we create a data frame using the RxSqlServerData function. A difference from using sqlQuery in the previous demo is that the data won't be read into the data frame until it is needed.

We then go on to create a histogram by using the rxHistogram function. You will find that most RevoScaleR specific functions are named with a starting rx.

NOTE: Microsoft R Server also contains the CRAN R packages you know and love (at least most of them).

When we have our histogram we decide we need some information of the data so we call rxGetInfo, which is more or less the equivalent of CRAN R summary().

Having come this far, it is time to create a model:

Logistic Regression

# do the logistic regression
system.time(
logit<-rxLogit(DidDefault~F(HouseAge)+F(Year)+CreditScore+YearsEmp+CreditCardDebt,
data=mydata,blocksPerRead=2,reportProgress=1))

Code Snippet 4:Logistic Regression using rxLogit

Instead of using glm() we use the specialized rxLogit function which is optimized for performance. We factorize HouseAge and Year by using the F function. So, what happens now when we execute it? If you run this yourself, please keep a close eye on the memory consumption in Task Manager.

Nothing much seems to happen with memory:

Figure 6:Memory Consumption using rxLogit

The memory more or less stayed the same during execution, and after 155 seconds (or so) we were done!

So we have seen how Microsoft R Server can help us when analyzing large data sets.

R and SQL Server

I mentioned above how SQL Server were the delivery mechanism of Microsoft R Server on Windows. SQL Server is not only that, but it also has R embedded, so you can in SQL Server execute R code - somewhat like extended stored procedures. In a future blog-post (or posts), I'll look at how R in SQL Server works, and what you can do with it.

Summary

At the beginning of this post I mentioned how CRAN R have some issues. Through Microsoft R Server some of these issues can be addressed:

Data movement - execute on SQL Server
Operationalization - execute your R code by using SQL Server stored procedures (once again, more about this in another post)
Scale / performance - RevoScaleR is offers both High Performance Computing (HPC) and High Performance Analytics (HPA).

If you have comments, questions etc., please comment on this post or ping me.

↧

Interesting Stuff - Week 8

February 25, 2017, 9:11 pm

≫ Next: Microsoft SQL Server 2016 R Services Installation

≪ Previous: Microsoft R Server

Throughout the week, I read a lot of blog-posts, articles, etc., that has to do with things that interest me

data science
data in general
distributed computing
SQL Server
transactions (both db as well as non db)
and other "stuff"

This is the "roundup" of the posts that has been most interesting to me, this week.

SQL Server

Choosing a Primary Key. Another post in Sean Cremer's series about database design. Full disclosure, he is a colleague of mine - but still a very good guy :). His SQL knowledge is immense!
SQL Server 2016 Developer Edition in Windows Containers. Announcement and introduction to the availability of SQL Server 2016 Developer Edition in Windows containers. This is a "biggie" for me who often want to spin up a new SQL Server instance. Now I can just have a container with the instance on it and spin it up!
SQLskills SQL101: Stored Procedures. First in a series of "back to the basics" by Kimberly and Paul. This covers my absolutely favorite feature in SQL Server: Stored Procedures! All you who are saying they are no good - I have a word for you: Heathens! :)
Architecting SQL Server on Linux: Slava Oks on Drawbridge, LibOS, & Addressing Between Windows/Linux. A podcast with Slava Oks, where Slava talks the implementation of SQL Server on Linux.

Streaming

Beam Graduates to Top-Level Apache Project. Beam is an Apache project seeking to create a unified programming model for streaming and batch processing jobs, and to produce artifacts that can be consumed by a number of supported data processing engines.
Fundamentals of Stream Processing with Apache Beam. More about Beam. This is a presentation about Beam's out-of-order stream processing as well as Beam tries to simplify complex tasks.
Kafka Summit New York. If you are doing streaming, then you most likely are interested in or, at least, have heard about Kafka. The yearly Kafka conference are coming up, so go ahead and register.

Data Science

Data Science in the Cloud @StitchFix. A conference presentation about how the cloud enables over 80 data scientists to be productive at StichFix.
Elastic Data Analytics Platform @Datadog. Conference presentation about DataDog's cloud-based analytics platform and how it differs from a traditional datacenter-based analytics stack.
R Tools for Visual Studio. R Tools for Visual Studio are getting closer and closer to a version 1.0 release.
Prophet - Forecasting at Scale. Prophet is an open-source package for R and Python that implements the time-series methodology that Facebook uses in production for forecasting at scale. Looks very, very interesting.
Microsoft R Server. I have to do some shameless self-promotion :). This is a blogpost by me comparing how CRAN R handles large datasets compared to Microsoft R Server.

That's all for this week. I hope you enjoy what I did put together. If you have ideas for what to cover, please comment on this post or ping me.

↧

Microsoft SQL Server 2016 R Services Installation

March 4, 2017, 12:19 am

≫ Next: Interesting Stuff - Week 9

≪ Previous: Interesting Stuff - Week 8

A couple of weeks beck I wrote a blog-post where I compared CRAN R with Microsoft R Server. The comparison was basically how large data sets are handled. In the post I mentioned that SQL Server 2016 (and later) is the delivery mechanism for Microsoft R Server on the Windows platform. In this post we will look at how to install and enable SQL Server R Services.

In the "Microsoft R Server" post I wrote that CRAN R, as good as it is, in certain scenarios it is less then ideal:

Data movement - the data you work with has to be moved from source to your machine.
Operationalization - having created an awesome model, how do you put it in production.
Scale / performance - R is single threaded and all data has to be in memory.

SQL Server R Services addresses these issues, which we will see later. For now let us see what SQL Server R Services is.

Introduction

Back in 2015 Microsoft bought Revolution Analytics. Revolution Analytics is a company who developed a big data, large scale multiprocessor computing, and multi-core functionality version of R: Revolution R Enterprise, which Microsoft re-branded to Microsoft R Server. At the heart of Microsoft R Server is RevoscaleR; an R package providing both High Performance Computing (HPC) and High Performance Analytics (HPA) capabilities for R. HPC allows distribution of the execution of essentially any R function across cores and nodes, and deliver the results back to the user, whereas HPA adds the ability to handle very large datasets (Big Data).

SQL Server R Services is the conduit between SQL Server and Microsoft R Server, and allows the execution of R code from inside SQL Server. Hmm, execution of R code from inside SQL Server, let's think about the implications of being able to do that, and put that in relation to the CRAN R issues above:

A data scientist can theoretically create his model(s) from inside the database. If the source of the data is the database, there are no movement of data.
The model(s) can be stored in the database and subsequently be used to analyze/predict new data. For example, a stored procedure can be created to output predictions. All of a sudden the issue of operationalization has been solved.
SQL Server is multi-threaded and supports parallelism, which should take care of the scale and performance issues in R. Furthermore the various RevoScaleR functions are optimized for multi-threading as well as large data volumes.

NOTE: If you are executing R code from an IDE (and not from inside SQL Server), you can still have your code execute on SQL Server by using an RevoScaleR feature; the compute context. The compute context is not "in scope" for this blog-post, but you can read more about it here.

Installation

The SQL Server R Services has to be explicitly installed in SQL Server, it is not like SQLCLR - which just needs to be enabled. Installing SQL Server R Service can be done when installing a new SQL Server instance or it can be added afterwards as a new feature. Here we'll see how we install the R Services together with a new instance of SQL Server:

Figure 1:SQL Server Installation Type

In Figure 1 we are choosing to do a new installation of SQL Server. We then choose what features we want installed:

Figure 2:Features to Install

When you install SQL Server 2016, you find a new feature under the Database Engine Services: R Services (In-Database), as in Figure 2. This is the install option when you want to integrate with R from within SQL Server. In the Feature Selection dialog, under Shared Features you also have R Server (Standalone), what is that all about:

Figure 3:Windows R Server

SQL Server 2016 acts as the delivery mechanism for Microsoft R Server on Windows, and the R Server feature allows you to install Microsoft R Server independent of SQL Server. For example, if a data scientist want to run R on his or her own machine.

NOTE: You may wonder why, in Figure 3, the check box for R Server is checked, but "dimmed" out? This is because, on this particular machine, I have already installed the standalone R Server once.

Go ahead and install after you have chosen the various features to install on your new instance. After successful installation we should check and make sure everything works, but before that let's talk about a couple of very important pieces of the SQL Server R Services "puzzle".

SQL Server Launchpad

Let us start with the launchpad service. After installation, if you go to Services, you will see your newly installed SQL Server instance SQL Server (InstanceName), but also a new service which hasn't been in earlier versions of SQL Server; the Launchpad service. You can see it in Figure 4 below:

Figure 4:Windows R Server

In the Introduction section I wrote how SQL Server R Services is the conduit between SQL Server and Microsoft R Server, and the launchpad service plays a big part of this. The service acts as a "routing" mechanism between SQL Server and external languages/runtimes. The responsibility of the launchpad service is to "spin up" and calling into the correct runtime when calling from inside SQL Server.

NOTE: Right now, the only "runtime"/language supported by the launchpad service is R.

Another piece of the "puzzle" is the the extended stored procedure: sp_execute_external_script

sp_execute_external_script

This is the procedure that will be used when executing R code inside SQL Server. The procedure will call into the launchpad service, and have the launchpad service "route" the call to the correct runtime.

The syntax for the procedure as copied from MSDN, looks like so:

sp_execute_external_script

sp_execute_external_script
@language=N'language',
@script=N'script',
@input_data_1=]'input_data_1'
[,@input_data_1_name=]N'input_data_1_name']
[,@output_data_1_name='output_data_1_name']
[,@parallel=0|1]
[,@params=]N'@parameter_name data_type [ OUT | OUTPUT ] [ ,...n ]'
[,@parameter1=]'value1'[OUT|OUTPUT][,...n]
[WITH<execute_option>]
[;]
<execute_option>::=
{
{RESULTSETSUNDEFINED}
|{RESULTSETSNONE}
|{RESULTSETS(<result_sets_definition>)}
}
<result_sets_definition>::=
{
(
{column_name
data_type
[COLLATEcollation_name]
[NULL|NOTNULL]}
[,...n]
)
|ASOBJECT
[db_name.[schema_name].|schema_name.]
{table_name|view_name|table_valued_function_name}
|ASTYPE[schema_name.]table_type_name
}

Code Snippet 1:Syntax of sp_execute_external_script

If you look at the syntax and thinks it looks convoluted, you are right. But do not worry, in a future blog-post we will look more in detail at what sp_execute_external_script does, and further down in this post we will see a very simple example of it. One thing to notice though is the first parameter in the proc: @language. This parameter defines the language/runtime, and as mentioned a couple of times before, right now the only language/runtime supported is R. There are rumors that Python will be supported in the future, as well as Julia.

Making Sure it Works

Now when we have installed the SQL Server R Services, and also have had a (very) short introduction to some of the "moving parts", let's enable SQL Server R Services, and execute something to actually see it works.

Enable SQL Server R Services

Enable, what do you mean enable? I have just installed it, isn't that enough? No it is not, after installation you need to enable the services. This is a bit like what you do with SQLCLR, you are enabling the execution of SQLCLR on the instance. Here you enable the execution of external scripts. To enable external scripts you change the configuration of the SQL Server instance you are on:

Enable External Scripts

EXECsp_configure'external scripts enabled',1
RECONFIGUREWITHOVERRIDE

Code Snippet 2:Execute sp_configure

When you have executed the code as in Code Snippet 2, you might think all is good. Unfortunately it is not, you now have to restart the instance of SQL Server! However, after the restart you can test and see that it works.

Executing R Script

To test it out we will execute a very simple R script:

Execution of R Script

EXECsp_execute_external_script@language=N'R',
@script=N'OutputDataSet<-InputDataSet',
@input_data_1=N'SELECT 42'
WITHRESULTSETS(([TheAnswer]intnotnull));
GO

Code Snippet 3:Test That R Installation Works

Let's look at the various parts of the code in Code Snippet 3:

We start with the @language parameter. We now know that it should be R.
Then we define the @script parameter. This is where our R code will go. In the code we are saying that output data (OutputDataSet) is whatever coming in (InputDataSet).
In @inout_data_1 we define the input dataset
We finally "formats" the output in the WITH RESULT SETS ... part.

When you execute the code, the output should be like so:

Figure 5:Result of Execution of R Script

Congratulations, you have now executed R code inside SQL Server. How cool is that?!

In future blog-posts we will look at:

How things work under the cover
What exactly is sp_execute_external_script
Some real world examples

If you have comments, questions etc., please comment on this post or ping me.

↧

Interesting Stuff - Week 9

March 4, 2017, 8:58 pm

≫ Next: Interesting Stuff - Week 10

≪ Previous: Microsoft SQL Server 2016 R Services Installation

Throughout the week, I read a lot of blog-posts, articles, etc., that has to do with things that interest me:

data science
data in general
distributed computing
SQL Server
transactions (both db as well as non db)
and other "stuff"

This is the "roundup" of the posts that has been most interesting to me, this week.

Hmm, this week there has not been much catching my eye. What has been is mostly data science / analytics related. Anyway, here we go.

Data Science

LearnAnalytics@MS. A Microsoft website which contains a lot of really cool resources if you want to learn about data science and analytics.
Backyard Data Science. I came across this site from a link from the site above. This site by Buck Woody from Microsoft contains a lot of absolutely awesome articles related to data science.
The InfoQ eMag: Getting a Handle on Data Science. From [InfoQ] an eMag looking at data science from the ground up, across technology selection, assembling raw and unstructured data, statistical thinking, machine learning basics, etc., etc.
Awesome - Most Cited Deep Learning Papers. A curated list of the most cited deep learning papers since 2012. A lot of really interesting papers in there.
Machine Learning is Fun!. A multi part guide to machine learning. It provides you with a high-level explanation of what machine learning is, and doing it in an easily understandable manner.
Optimisation and training techniques for deep learning. Adrian Colyer from the morning paper has waded through the papers mentioned above pertaining to "optimisation and training techniques". I believe I have mentioned it before, but I'll do it again: the morning paper is an excellent source for data science related "stuff".
Microsoft SQL Server 2016 R Services Installation. A post from yours truly. The post is the first in a series of posts about Microsoft SQL Server R Services. In this post I cover installation and enabling of SQL Server R Services.

Big Data & Databases

Google Launches Cloud Spanner Public Beta. In my weekly roundup for week 7 I had a link to an article of Spanner, the Google database that can span multiple geo-locations and still be seen as one instance. This article from InfoQ talks about how Spanner has gone into public beta.

That's all for this week. I hope you enjoy what I did put together. If you have ideas for what to cover, please comment on this post or ping me.

↧

Interesting Stuff - Week 10

March 11, 2017, 9:39 pm

≫ Next: Microsoft SQL Server R Services - Internals I

≪ Previous: Interesting Stuff - Week 9

Throughout the week, I read a lot of blog-posts, articles, etc., that has to do with things that interest me

data science
data in general
distributed computing
SQL Server
transactions (both db as well as non db)
and other "stuff"

This is the "roundup" of the posts that has been most interesting to me, this week.

Data Science

Deep Learning for Sensor Fusion and Sequence Classification. Faisal discusses how the Microsoft Cognitive Toolkit can be used for sequence classification.
Data Preprocessing vs. Data Wrangling in Machine Learning Projects. Article from InfoQ, which compares different alternative techniques to prepare data for machine learning. Techniques include extract-transform-load (ETL) batch processing, streaming ingestion and data wrangling.
TensorFlow 1.0 Released. Another article from InfoQ. This article is about the release of Google's TensorFlow.
rxNeuralNet vs. xgBoost vs. H2O. In version 9.0.3 of Microsoft R Server, Microsoft has introduced a new package for Microsoft R Server; MicrosoftML. The package brings new machine learning functionality with improvements in speed, performance and scalability. In Tomaz blog-post he puts the new functionality to test.
Microsoft Data Science Newsletter. If you are interested in what Microsoft is doing in data science you should definitely subscribe to the monthly newsletter.
Employee Retention with R Based Data Science Accelerator. Cool "stuff from" Revolution Analytics about how to use R to analyze employee retention.
Announcing R Tools for Visual Studio. R Tools for Visual Studio has been released!

Distributed Computing

Redundancy does not imply fault tolerance: analysis of distributed storage reactions to single errors and corruptions. Adrian dissects a white-paper pertaining to how distributed storage reacts to errors and corruptions. My conclusion; be afraid, be very afraid!

SQL Server

Why PFS pages cannot be repaired. Paul Randal from SQLskills fame explains why DBCC CHECKDB cannot repair Page Free Space pages. Very cool "stuff"!
SQL Server on Linux, will it perform?. This is the slide deck from Slava Oks presentation at QCon in London this year about SQL Server on Linux. Amazing! Cannot wait for the video to be published!
Context in perspective 1: What the CPU sees in you. Ewald has a series about context in SQL Server, and this is the first post. So, so interesting! As a side note; you should really follow Ewald's blog if you are interested in various and sundry, deeply technical topics of how SQL Server works under the covers! WinDbg FTW!

That's all for this week. I hope you enjoy what I did put together. If you have ideas for what to cover, please comment on this post or ping me.

↧

Microsoft SQL Server R Services - Internals I

March 18, 2017, 7:21 am

≫ Next: Interesting Stuff - Week 11

≪ Previous: Interesting Stuff - Week 10

This post is part of a series of blog-posts about Microsoft SQL Server R Services:

Microsoft SQL Server 2016 R Services Installation
Microsoft SQL Server R Services - Internals I (this post)
More to come (hopefully)

In this post, and one or two more we will look at what goes on under the covers when we execute some R code inside SQL Server. This post looks at quite in detail what happens in the SQL engine when we execute sp_execute_external_script.

To try and get an understanding we'll do something that we did quite a lot back in the day when I worked at Developmentor; we'll "spelunk" the SQL Server code via WinDbg. This can be really useful when trying to understand and get to grips with new technology/functionality.

NOTE: Developmentor were back in the day THE training company to go to if you wanted highly, highly technical training about COM, .NET, SQL Server, etc. This article by my old colleague Ted Neward (@tedneward) sums DM up quite accurately (apart from the fact that DM hadn't closed its doors when the article was written, ooops).

Overview

The first post in the series discussed how to install and enable SQL Server R Services. In there I mentioned how the SQL Server R Services are different from SQLCLR in that the R engine is external to SQL Server, whereas in SQLCLR, CLR is loaded into the same process as the SQL Server engine.

So in SQL Server R Services, there must be a way to communicate out of the SQL engine, and into the R engine/runtime, and back into the SQL Server process. Before we'll start trying to figure out what is going on, let's make sure WinDbg is set up.

WinDbg

If you want to follow along with what I did, and you haven't used WinDbg before, this section talks about how to attach to a process etc. If you are used to this, please skip it.

In order to use WinDbg, we need to ensure we have the symbol file path set. The easiest is to have the symbol path pointing to Microsoft's Symbol Server. To set things up and also get an introduction to debugging SQL Server with WinDbg, go and read Klaus Aschenbrenner's (@Aschenbrenner) excellent introduction to the subject.

When you want to debug and having the symbol path set you can attach to the SQL Server process, either by doing it as in Klaus' post, or you can attach from within WinDbg:

Figure 1:Attach to Process Menu

In WinDbg you either choose Attach to a Process from the File menu as in Figure 1, or you click F6. You are then presented with the Attach to Process dialog, where you choose sqlservr.exe as in Figure 2:

Figure 2:Attach to Process Dialog

NOTE:NEVER, EVER run WinDbg against a production SQL Server, NEVER!!

When you have attached to the process it can be prudent to reload the symbols for that process, by executing .reload -f sqlservr.exe from the WinDbg command line:

Figure 3:Reload Symbols

You should now be good to go.

sp_execute_external_script

We'll start with the procedure sp_execute_external_script. When reading official documentation, it says that the procedure is an extended stored procedure, and, indeed, if we try and do sp_helptext sp_execute_external_script, the result coming back is like so:

Figure 4:Result of sp_helptext Against sp_execute_external_script

The result indicates that this is internal to the server. Let us see if we can find out what happens when executing the proc by using WinDbg.

An assumption we'll make is that when we execute the proc, there will be some symbols with a name something like ExternalScript among the various SQL Server modules. So, let us see what we can find. We do it by using the x command from the WinDbg command-line, like so: x *!*ExternalScript* (the "*" denotes a wild-card, like "%" in T-SQL). Whoops, that returned quite a lot of information:

Figure 5:Result of Looking for Symbols With ExternalScript in the Name

OK, but we are probably onto something here. When skimming the result we see that ExternalScript occurs in two modules:

sqllang - Implements things to do with T-SQL as well as the query engine.
sqlmin - Implements things related to the relational engine.

Seeing that it occurs in sqllang and sqllang has to do with T-SQL etc., a fairly solid assumption is that we should look in the sqllang module for anything that has to do with that procedure. So just for giggles, let us execute on the WinDbg command-line the following: x sqllang!*execute*externalscript*:

Figure 6:sqllang!SpExecuteExternalScript

Cool, we found something that probably is what we are after: SpExecuteExternalScript.

To see if our assumption is correct we now set a break-point at SpExecuteExternalScript: bp sqllang!SpExecuteExternalScript. After the breakpoint is set, do not forget to click F5 to continue the process. At this stage we can now execute the sp_execute_external_script procedure and see if the breakpoint was hit:

Execution of Procedure

EXECsp_execute_external_script
@language=N'R',
@script=N'OutputDataSet<-InputDataSet',
@input_data_1=N'SELECT 42'
WITHRESULTSETS(([TheAnswer]intnotnull));
GO

Code Snippet 1:Execute sp_execute_external_script

As can be seen in Figure 7 below, the breakpoint is hit. It seems that our assumption is right:

Figure 7:Hitting the Breakpoint

When we have hit the breakpoint as in Figure 7, what do we do then? Well, we can always use the k command to look at the call-stack up until the breakpoint:

Figure 8:Partial call-stack

Figure 8 shows part of the call-stack, and we can see what routines were called leading up to SpExecuteExternalScript. When you have looked at the call-stack you can hit F5, to let the routine complete and you should see a result in the Results tab in SSMS.

But we are not really anywhere closer to understand what is going on, except that we know what has been called up until the breakpoint. We are interested in what routines SpExecuteExternalScript calls. To find out about that, there is a command in WinDbg, which allows us to disassemble routines; the uf command. The command signature looks like so: uf [options] <address>, where the address is the routine, and the options define how to display the result. One of the options is /c which displays only the call instructions in a routine. Let's execute uf /c sqllang!SpExecuteExternalScript and see what is being called. Quite a few calls are made and the figure below shows some of them:

Figure 9:Calls Being Made by SpExecuteExternalScript

When looking at the calls, there is nothing really that stands out, apart from the sqllang!CSQLSource::Execute (00007ff9 ee237ec0) call. We can assume that, that call takes us further down the "rabbit-hole", and we eventually could figure out what is going on. Tracing down could however take quite a while, so let us try another angle.

NOTE: When you have hit a breakpoint in WinDbg you can use a trace command wt to continue the execution and at the same time print out the calls being made. For a call like SpExecuteExternalScript the output will get very, very large, and also not completely easy to interpret, so we will not use it for SpExecuteExternalScript. We will use it a bit later though .

What we will do instead is to go back and look at symbols. The assumption is that calls further down the call-chain will have something to do with external scripts, and most likely be executed from within the sqllang module.

So let us execute a variant of what we did when finding SpExecuteExternalScript; we'll execute x sqllang!*ExternalScript*. Quite a few routines comes back, I have copied some of the ones that might be of interest to us into the code snippet below:

ExternalScript Related Classes and Routines

0:113>xsqllang!ExternalScript
sqllang!CUDXR_ExternalScript::PrepareLauncherInfo(<noparameterinfo>)
sqllang!CUDXR_ExternalScript::ConnectToSatellite(<noparameterinfo>)
sqllang!CUDXR_ExternalScript::Open(<noparameterinfo>)
sqllang!CUDXR_ExternalScript::SetupService(<noparameterinfo>)

Code Snippet 2:Result from x sqllang!*ExternalScript*

If you want, you can set breakpoints for the routines in Code Snippet 2, and see if they are hit when executing the code in Code Snippet 1. In my tests they were hit in the following order:

sqllang!SpExecuteExternalScript
sqllang!CUDXR_ExternalScript::Open
sqllang!CUDXR_ExternalScript::SetupService
sqllang!CUDXR_ExternalScript::PrepareLauncherInfo
sqllang!CUDXR_ExternalScript::ConnectToSatellite:

No surprise we hit SpExecuteExternalScript first, and I guess at one stage the script has to Open. The question is what the other three routines are doing?

In the Overview section in the beginning of this post I wrote how we need to communicate out of SQL Server in order to get to the R runtime. In my previous post, I mentioned the Launchpad service, and how it acts as a "routing" mechanism between SQL Server and external languages/runtimes.

So, somehow SQL Server calls into the launchpad service in order to have the R engine execute the R code. The routines SetupService and PrepareLauncherInfo, has to do with the launchpad service., and we'll shortly have a closer look at what SetupService does. The ConnectToSatellite routine is for when results are coming back into SQL Server from the external runtime.

NOTE: Before we go any further, make sure that you are not sitting at a break-point, e.g. hit F5 to let the debugger run.

What about SetupService then, let us start with disassemble the routine to see what code is being called: uf /c sqllang!CUDXR_ExternalScript::SetupService (all code is not showing, I have copied certain interesting parts):

SetupService

0:125>uf/csqllang!CUDXR_ExternalScript::SetupService
sqllang!CUDXR_ExternalScript::SetupService+0xa6(00007ff877179ee6):
calltosqllang!CSQLSatelliteCommunication::Init(00007ff87763bfc0)
...
sqllang!CUDXR_ExternalScript::SetupService+0x19c(00007ff877179fdc):
calltosqllang!CSQLSatelliteConnection::OpenNpConnection(00007ff87763c480)
...
sqllang!CUDXR_ExternalScript::SetupService+0x3ab(00007ff87717a1eb):
calltosqllang!CSQLSatelliteConnection::WriteMessage(00007ff87763b140)
sqllang!CUDXR_ExternalScript::SetupService+0x3b5(00007ff87717a1f5):
calltosqllang!CSQLSatelliteConnection::FreePackets(00007ff87763bc70)

Code Snippet 3:Interesting Calls in SetupService

When looking at the call in Code Snippet 3 we can see the sqllang!CSQLSatelliteCommunication::Init call. That is a call to initialize communication with the launchpad service. Then, somewhat later, there is an OpenNpConnection call. That call opens a named pipe connection to the launchpad service.

The WriteMessage call finally sends the message packet to the launchpad service, and FreePackets releases the message packet. To further see what is going on, let's trace what WriteMessage is doing.

If you are at a breakpoint right now, F5 out of there and then break into the debugger again, and set a breakpoint at sqllang!CSQLSatelliteConnection::WriteMessage. Execute your T-SQL code again and continue until you hit the breakpoint you just set. When you hit the breakpoint enter the trace command wt. This will now run through the whole function and display what is being called as well as statistics about how many times etc., routines were called. In the code snippet below I have chosen some of the more interesting calls:

WriteMessage

0:013>wt
Tracingsqllang!CSQLSatelliteConnection::WriteMessagetoreturnaddress00007ff87717a1f0
230[0]sqllang!CSQLSatelliteConnection::WriteMessage
...
30109[0]sqllang!CSQLSatelliteConnection::WriteMessage
260[1]sqllang!SNIWriteAsync
210[2]sqllang!Np::WriteAsync
320[3]sqllang!Np::SendPacketAsync
300[4]sqllang!SNI::detail::Transport::PrepareForAsyncCall
4030[3]sqllang!Np::SendPacketAsync
10[4]KERNEL32!WriteFile
370[4]KERNELBASE!WriteFile
60[5]ntdll!ZwWriteFile
516[4]...
34230[1]sqllang!SNIWriteAsync
...

Code Snippet 4:Tracing WriteMessage

In Code Snippet 4 we see WriteAsync, SendPacketAsync and WriteFile is being called. At this stage, the packet has now been sent to the launchpad service. Before you let the process continue, disable the WriteMessage breakpoint, as it will be called when the result returns from R.

To make really sure that what we think happens actually happens we can do a last test, involving Launchpad.exe. In the next post in this series we will look at what happens in the launchpad service in more detail, but for now let us just do a simple test:

Open a second instance of WinDbg and attach to the Launchpad.exe process.
Reload the launchpad symbols: .reload -f launchpad.exe.
Set a breakpoint like so: bp launchpad!CLaunchContext::Launch
In the WinDbg instance for the SQL Server process set breakpoints at:
- sqllang!CUDXR_ExternalScript::SetupService
- sqllang!CSQLSatelliteConnection::WriteMessage
- sqllang!CUDXR_ExternalScript::ConnectToSatellite
Ensure that all other breakpoints for the SQL Server process are disabled
Make sure that both processes are running (i.e. not sitting in break mode).

Execute the T-SQL code and notice what happens (you need to press F5 after hitting each breakpoint):

You hit the breakpoint in the SQL Server process at SetupService.
You hit the breakpoint in the SQL Server process at WriteMessage.
You now hit the Launch breakpoint in the launchpad process.
You are back in the SQL Server process at the ConnectToSatellite breakpoint.

After you have pressed F5 at the ConnectToSatellite breakpoint you will hit the WriteMessage breakpoint quite a few times when the result comes back from R.

Summary

UPDATE & EDIT: To make the summary more "readable" I have added Figure 10 and rearranged (and added) some text.

Through our "spelunking", we have seen in somewhat detail what happens in the SQL Server engine when we execute sp_execute_external_script. Figure 10 below shows from a very high level what goes on in the SQL Server engine:

Figure 10:Call Flow Executing sp_execute_external_script

Following the flow in Figure 10, when executing sp_execute_external_script:

EXEC sp_execute_external_script.
Comes into the SQL Server process, and workers, schedulers, tasks, etc., comes into play.
Eventually sqllang!CSQLSource::Execute is called (first invocation - not the one shown in Figure 9).
Our friend sqllang!SpExecuteExternalScript is called.
The external script is opened in sqllang!CUDXR_ExternalScript::Open.
Things are hotting up and sqllang!CUDXR_ExternalScript::SetupService is hit.
A named pipe connection to the launchpad service is opened in sqllang!CSQLSatelliteConnection::OpenNpConnection.
A message containg the R scrips is written to the launchpad service in sqllang!CSQLSatelliteConnection::WriteMessage.
That message eventually ends up in the launchpad process in launchpad!CLaunchContext::Launch.

In between the calls mentioned, a lot of other calls are also made, but from a high level - this is what happens.

If you have followed along, you can now go off and do your own "spelunking". In next post in this series we will look at what happens in the launchpad service in more detail.

If you have comments, questions etc., please comment on this post or ping me.

↧

Interesting Stuff - Week 11

March 18, 2017, 10:59 pm

≫ Next: Interesting Stuff - Week 12

≪ Previous: Microsoft SQL Server R Services - Internals I

Throughout the week, I read a lot of blog-posts, articles, etc., that has to do with things that interest me

data science
data in general
distributed computing
SQL Server
transactions (both db as well as non db)
and other "stuff"

This is the "roundup" of the posts that has been most interesting to me, this week.

Distributed Computing

From Microliths to Microsystems: Jonas Bonér at QCon London. Jonas Boner discusses micro-services and points out that quite a few micro-services are really microliths (mini monoliths). A micro-service needs to be designed as a distributed system.
Concurrent and Distributed Programming in the Future. Joe Duffy, who previously was Director of Engineering for Languages and Compilers at Microsoft, gives a keynote at QCon London where he talks about concurrent programming. I am so looking forward to the video and slides of this keynote.
Joe Duffy's Blog. So I didn't come across this blog this week, but having mentioned Joe Duffy above, I have to point out his blog which is a goldmine if you are interested in concurrent programming and distributed systems. His series of posts about Midori is a must read.
Conference Recap: Google Cloud Next. As the title says; a recap of Google's Cloud Next conference.

SQL Server

SQLskills SQL101: Temporary table misuse. A post in the SQLSkills SQL101 series, by Paul Randal, about temp-tables and how they can be mis-used.
Comparing performance of data access libraries using StackExchange/Dapper benchmark. Disclaimer: I detest ORM's with a vengeance. So, some benchmarking by Jovan Popovic, from Microsoft, comparing different data access libraries. I'll let you draw your own conclusions, but I am a happy "bunny".
The Vietnam of Computer Science. A post I read a long, long time ago - but as it has some touch-points with the post above I thought I'd share it. Ted Neward, an ex colleague of mine, wrote this post back in 2006 where he compared ORM's with the Vietnam war.
SQLskills SQL101: Indexing Basics. Another post from the SQLSkills SQL101 series. This one is by Kimberly Tripp, and it talks about indexes.
#TSQL2SDAY: The string length server. A tongue in cheek entry for T-SQL Tuesday by Ewald. As a side note, as wrote in lasts week roundup, read his blog if you want to get into the "nitty gritty" of SQL Server internals.
Microsoft SQL Server R Services - Internals I. This post is part 2 of my series about Microsoft SQL Server R Services. This post is about the internals, more specifically about what happens in the SQL engine when you execute sp_execute_external_script.

Streaming

Big Data Processing with Apache Spark – Part 1: Introduction. This article is the first in a series named "Big Data Processing with Apache Spark". A lot of very useful articles!

Data Science

Neural Networks: How they work, and how to train them in R. Neural networks are so "in" at the moment in the data science world. This post by Revolution Analytics summarizes various R packages useful for neural networks and points to a video about neural networks.
Microsoft Data Amp—where data gets to work. Information and registration for an upcoming Microsoft virtual conference about Microsoft's data platform.
Monte Carlo Planning Improves Decision Making. An article which discusses some innovative uses for Monte Carlo simulation.
doAzureParallel: Take advantage of Azure’s flexible compute directly from your R session. This post introduces doAzureParallel, an R package which allows R users to scale up their work to take advantage of cloud compute.

That's all for this week. I hope you enjoy what I did put together. If you have ideas for what to cover, please comment on this post or ping me.

↧

Interesting Stuff - Week 12

March 26, 2017, 10:59 am

≫ Next: Microsoft SQL Server R Services - Internals II

≪ Previous: Interesting Stuff - Week 11

Throughout the week, I read a lot of blog-posts, articles, etc., that has to do with things that interest me

data science
data in general
distributed computing
SQL Server
transactions (both db as well as non db)
and other "stuff"

This is the "roundup" of the posts that has been most interesting to me, this week.

Transaction Systems

Omid reloaded: scalable and highly-available transaction processing. the morning paper looks at Apache Omid, which is a transactional framework that allowing ACID transactions on top of MVCC key/value NoSQL data-stores.

SQL Server

Context in perspective 5: Living next door TLS. Ewald continues his series about context in SQL Server.
SQL Server on Linux: Will it Perform or Not?. In the roundup for week 10, I pointed to Slava Oks slide deck from QCon in London about SQL Server on Linux. This is the video from the presentation.

Streaming

Implementing The Schema Registry. Interesting article about how Sky Betting & Gaming use the Confluent Schema Registry to ensure that various teams always encode and decode messages using the same schema.
Queryable State in Apache Flink® 1.2.0: An Overview & Demo. A very interesting post about Apache Flink now allows you to query application state from external applications.

Data Science

Convolutional neural networks, Part 1. the morning paper dissects some white papers about Convolutional Neural Networks (CNN).
Is it possible to use RevoScaleR package in Power BI?. Tomaz shows how RevoScaleR can be used from inside Power BI, pretty cool!
Alteryx integrates with Microsoft R. Revolution Analytics posts about how Alteryx now supports Microsoft R Server as well as SQL Server R Services. Alteryx is a workflow tool combining data preparation, data blending, and analytics – predictive, statistical and spatial. It looks very interesting!
Running your R code on Azure with mrsdeploy. Another blog-post from Revolution Analytics, this explains how to provision and run an Azure virtual machine (VM), using the mrsdeploy library that comes installed with Microsoft’s R Server.
Retail Customer Churn Prediction: How-To Guide Now Available. Predicting customer churn is almost the "holy grail" in machine learning. Microsoft has done a lot of research about churn prediction, and have now released their Retail Customer Churn Prediction Solution How-to Guide.
End-to-End Data Science Walkthrough with Spark 2.0 on Azure HDInsight Hadoop Clusters. Microsoft has published a tutorial how to use pySpark and MLlib for data science on Spark 2.0 clusters.
Announcing R Tools 1.0 for Visual Studio 2015. More about R Tools for Visual Studio.

That's all for this week. I hope you enjoy what I did put together. If you have ideas for what to cover, please comment on this post or ping me.

↧

Microsoft SQL Server R Services - Internals II

April 2, 2017, 12:38 am

≫ Next: Interesting Stuff - Week 13

≪ Previous: Interesting Stuff - Week 12

This post is part of a series of blog-posts about Microsoft SQL Server R Services:

Microsoft SQL Server 2016 R Services Installation
Microsoft SQL Server R Services - Internals I
Microsoft SQL Server R Services - Internals II (this post)
More to come (hopefully)

This post is the third post about Microsoft SQL Server R Services, and the second post that drills down into the internal of how it works. In the previous internals post we looked at what happens inside the SQL Server engine when we execute sp_execute_external_script. In that post we wrapped up when we reached the launchpad service (Launchpad.exe). This post will look closer at the launchpad service, and we will do it by some more "spelunking"

Recap

In both previous posts about SQL Server R Services I have mentioned that the launchpad service is responsible for launching an external runtime when we execute sp_execute_external_script. In the Internals I post the following picture showed what happened inside the SQL server engine when executing the procedure:

Figure 1:Call Flow Executing sp_execute_external_script

From Figure 1 we see how an named pipe connection is opened from the SQL Server engine into the launchpad service, and eventually the routine sqllang!CSQLSatelliteConnection::WriteMessage writes a message to the service. The message will at one stage or another cause the launchpad!CLaunchContext::Launch routine in the launchpad service to be called. In a little while we'll see how we came to that conclusion.

Launchpad

Launchpad is a new service installed together with SQL Server 2016, and it is there to support execution of scripts targeting external runtimes/engines. The launchpad service calls into launchers, and it is the launcher's job to launch the correct runtime/engine. How does the launchpad service know what launcher dll to call into? To answer that, cast your mind back to the previous post about internals, in that post we looked at the procedure used to execute external scripts: sp_execute_external_script, and we executed some code like so:

Execution of Procedure

EXECsp_execute_external_script
@language=N'R',
@script=N'OutputDataSet<-InputDataSet',
@input_data_1=N'SELECT 42'
WITHRESULTSETS(([TheAnswer]intnotnull));
GO

Code Snippet 1:Execute sp_execute_external_script

Looking at the code in Code Snippet 1 we see the first parameter being the @language parameter, and it is this parameter that tells the launchpad service to use (in this case) the launcher for R.

Launchers

The question still remains though; how does the launchpad service know what specific launcher dll to use? To answer that let us look a little bit more closely at the launchpad service, in the properties dialog:

Figure 2:SQL Server Launchpad Service

If we look at the Path to executable setting under the General as in Figure 2 tab we may get some more insight:

launchpad.exe

"C:\<path_to_SQL_instance\>\MSSQL\Binn\launchpad.exe"
 -launcher RLauncher.dll
 -pipename sqlsatellitelaunch
 -timeout 600000
 -logPath "C:\<path_to_SQL_instance>\MSSQL\LOG\ExtensibilityLog"
 -workingDir "C:\<path_to_SQL_instance>\MSSQL\ExtensibilityData"
 -satelliteDllPath "C:\<path_to_SQL_instance>\MSSQL\Binn\sqlsatellite.dll"

Code Snippet 2:Path to Executable for Launchpad.exe

After having copied the value in Path to executable we see is shown in Code Snippet 2. And in there we can see a command line argument -launcher with a value of RLauncher.dll. If we search for a file named RLauncher.dll we find it in the Binn directory together with all other SQL Server files:

Figure 3:RLauncher

So, a theory is that during startup, the launchpad service reads in the value of the -launcher argument, and perhaps even loads the dll. Is there any way we can prove that theory? We can try:

Go to Services and stop the launchpad service
Delete all files from the directory the -logPath parameter points to.
Start the launchpad service.

You should now see a couple of new files in the log directory, and when you open them you can see how there are log messages about RLauncher.dll. If you have Process Explorer installed you can also verify that RLauncher.dll is loaded by finding the launchpad service process and then look at dll's as in Figure 4 below:

Figure 4:RLauncher Loaded

Before we start "spelunking" with WinDbg, let's look at the arguments used by the launchpad service (as seen in Code Snippet 2) and see what they mean:

-launcher: Full path to the launcher.
-logPath: The launchpad's base log path.
-satelliteDllPath: The sql satellite dll path for the satellites, we'll talk more about them in subsequent posts.
-workingDir: The launchpad and satellite process base working directory.
-cleanupLog: Whether to cleanup the log directory after every execution [0|1] (not set above).
-cleanupWorkingDir: Whether to cleanup the working directory after every execution [0|1] (not set above).
-pipeName: Define the launchpad's name pipe's name - this is from connection from SQL Server.
-timeout: Define the default timeout in ms.
-SqlInstanceName: Define the SqlInstanceName as in MSSQLSERVER or blank for default or an instance name (not set above).

Some of the arguments (most in fact) I found in what I copied from the properties dialog for the service and the field Path to executable mentioned above. However a couple of the arguments were found by, from command prompt, trying to run the launchpad executable like so: C:\<path_to_sql_server_instance>\MSSQL\Binn>launchpad.exe without any arguments. That resulted in an error and some help how to run the launchpad service.

What was interesting in that, was that it - apart from listing the arguments above - also gave an example:

Launchpad Example

Example:
launchpad.exe
    -launcher RLauncher.dll
    -launcher PythonLauncher.dll
    -logPath C:\Temp
    -pipeName mypipename
    -timeout 60000
    -SqlInstanceName MSSQLSERVER

Code Snippet 3:Example of How to Launch Launchpad Defining Multiple Launchers

Notice how in Code Snippet 3 above, multiple launchers are defined, and a launcher for Python being one of them. Maybe we'll soon see Python being supported as well!

WinDbg Investigations

By now we know (or at least we have a strong hunch) that during startup of the launchpad service, the launchers are loaded. Now it is time to start drilling down in what happens inside the launchpad service when we execute code as in Code Snippet 1. First, let us look at what happens during connection from the SQL Server engine to the launchpad service. In the Internals - I post I mentioned how SQL Server opens a named pipe connection to the launchpad service in the sqllang!CSQLSatelliteConnection::OpenNPConnection routine.

When named pipes are used, the client connects to a certain named pipe through the ConnectNamedPipe routine. Let's do some "spelunking" in the launchpad service using WinDbg, and see if we can find anything that can have anything to do with named pipes:

Stop the launchpad service if it is running.
Restart the service.
Open an instance of WinDbg and attach to the Launchpad.exe process.
Reload the symbols: .reload /f.

NOTE: The internals - I post has more information how to attach to a process, and what commands to use.

Now, do a search for routines named like ConnectNamedPipe: x *!*ConnectNamedPipe*:

Figure 5:Routines Named ConnectNamedPipe

Figure 5 shows the result after the search, and the KERNELBASE!ConnectNamedPipe routine looks promising. To find out what happens, we'll:

Open a second instance of WinDbg.
Attach it to the sqlservr.exe process.
Reload the symbols by: .reload /f.
Set a breakpoint at the OpenNPConnection routine: bp sqllang!CSQLSatelliteConnection::OpenNPConnection.

In the debugger attached to the launchpad service we set a breakpoint at ConnectNamedPipe: bp KERNELBASE!ConnectNamedPipe. We then execute the code in Code Snippet 1 and see what happens:

The sqllang!CSQLSatelliteConnection::OpenNpConnection breakpoint is hit.
After continuing, the KERNELBASE!ConnectNamedPipe breakpoint in the launchpad service is hit.

The call-stack at this point looks something like so (call stack output by k c):

Callstack

0:007>kc
# Call Site
00KERNELBASE!ConnectNamedPipe
01launchpad!Np::AsyncAccept+0x143
02launchpad!Np::PrepareForNextAccept+0x9b
03launchpad!SNIAcceptDoneWrapper+0x83
04sqldk!SOS_Task::Param::Execute+0x231
05sqldk!SOS_Scheduler::RunTask+0xaa
06sqldk!SOS_Scheduler::ProcessTasks+0x3cd
07sqldk!SchedulerManager::WorkerEntryPoint+0x2a1
08sqldk!SystemThread::RunWorker+0x8f
09sqldk!SystemThreadDispatcher::ProcessWorker+0x2de
0asqldk!SchedulerManager::ThreadEntryPoint+0x1d8
0bKERNEL32!BaseThreadInitThunk+0x14
0cntdll!RtlUserThreadStart+0x21

Code Snippet 4:Callstack at KERNELBASE!ConnectNamedPipe

So that is how SQL Server connects into the launchpad service:

SQL Server calls sqllang!CSQLSatelliteConnection::OpenNpConnection.
The launchpad service is doing launchpad!SNIAcceptDoneWrapper.
Followed by KERNELBASE!ConnectNamedPipe, and the named pipe is now open.

NOTE: The above is very simplified, in fact a lot of things happen in parallel, and we'll touch upon that a bit later.

Let us now go on and have a look what happens after the connection has been made. How do we go about finding out what happens? Well, we can do it the "brute force" way:

We know that the launcher for R will try and load the R runtime, and if we look in the MSSQL\Binn folder where the RLauncher.dll resides we find a config file for the launcher: rlauncher.config. Let's see what it contains:

Figure 5:RLauncher Configuration

In Figure 5 we see that the configuration file for the launcher contains the RHOME path. With that in mind we can assume that the launcher will call into there and launch the runtime. What happens if the launcher cannot find the path? An exception would probably be thrown, and if we were debugging we could hopefully catch it and have a look at the stack. To test this theory:

Stop the launchpad service if it is running.
Remove the R_SERVICES directory and its content and place it somewhere else.
Delete all files from the directory the -logPath argument in Code Snippet 3 points to.
Restart the launchpad service.

NOTE: Please, please, please DO NOT do this on a production server!!

The restart should just go fine, and you can now attach WinDbg to the launchpad process:

Open an instance of WinDbg and attach to the Launchpad.exe process.
Reload the symbols: .reload /f.
Hit F5 to let the debugger run.

When the debugger runs, you execute the code in Code Snippet 1, and you get an ugly error in management studio:

Figure 6:SQL Exception

Yeah, kind of obvious that the runtime for R cannot be launched as it is nowhere to be found. WinDbg also reports some exceptions but the debugger is still running. When looking at the exceptions you see something like so:

WinDbg Exception

(2fcc.34c8): C++ EH exception - code e06d7363 (first chance)
(2fcc.35c): C++ EH exception - code e06d7363 (first chance)
[2017-04-01 07:14:54.605][Error]
ProcessPool::CreateProcess(MSSQLSERVER01-ss-2,
  78A18DD6-14A7-4BFA-BC1C-664A653A070C) failed with:
  Failed with (80070003) to start executable
  C:\<path_to_sql_instance>\R_SERVICES\bin\x64\rterm.exe
  with args  --slave --no-restore --no-save -e "library(RevoScaleR);

Code Snippet 5:C++ EH Exception

In Code Snippet 5 we see that the launcher tries and create a process for RTerm.exe which is the entry-point to the R runtime. So the question is what was being called to get to this point. If you break out of the debugger and under the Debug menu choose Event Filters:

Figure 7:Event Filter

In there you can choose how certain events are handled. In Code Snippet 5 we see that the exception we encounter is a C++ EH exception, so in the Event Filters dialog you can set how that particular exception should be handled. We want it to be enabled, but not handled:

Figure 8:Enable C++ EH Exception

When you execute the code again after having enabled the exception as in Figure 8, the debugger will now break at the exception, and you can view the call-stack through the k command. In Code Snippet 6 below I show an abbreviated part of the call-stack:

Call-stack

0:010> k
...
06  RLauncher!GetInstance+0x3239f
07  RLauncher!GetInstance+0x32a5d
08  RLauncher!GetInstance+0x5aeee
09  RLauncher!GetInstance+0x3fbca
0a  RLauncher!GetInstance+0x1c864
0b  RLauncher!SQLSatellite_GetLauncherAPI+0x9dd
0c  launchpad!CLaunchContext::Launch+0x160
0d  launchpad!CLaunchContext::LaunchInternal+0x2df
0e  launchpad!CLaunchContext::LaunchServTask+0x357
0f  sqldk!SOS_Task::Param::Execute+0x231

Code Snippet 6:Call Stack at Exception

In Code Snippet 6 we see the launchpad!CLaunchContext::Launch routine, the same we identified in the internals I post. That routine has an important part to play when calling into the launcher. When you look at Code Snippet 5 you also see routines from the RLauncher module. Unfortunately we cannot really see what goes on inside the launcher as there is no symbol file for it.

Coming back to launchpad!CLaunchContext::Launch; I said it has an important part to play, and it does. However the launchpad!CLaunchContext::LaunchServTask routine which you also can see in Code Snippet 6 is even more important. That routine sets up most of the things that happen when SQL Server calls into the launchpad service. When I talked about the named pipe connection and how things happened in parallel, if you were to set a breakpoint at launchpad!CLaunchContext::Launch and then output the call-stack and compared that call-stack to what is shown in Code Snippet 4, you would see that both call-stacks have the same originating methods and addresses.

Seeing that we don't have symbol files for the RLauncher module we have to "hunt" around in WinDbg as well as making some assumptions of what is happening if we want to go further in the "spelunking". When we, in Code Snippet 5 looked at the exception we received in the debugger we saw something about CreateProcess, and we already saw the Launch routine in Code Snippet 6. What if we were to look for something like that in the launchpad module: x launchpad!*Launch*Process*. That reveals:

launchpad!SQLSatellite_LaunchProcess
launchpad!PhysicalUserContext::LaunchProcess

Especially the second; launchpad!PhysicalUserContext::LaunchProcess, is interesting - as when launching the R runtime it should be done in the context of a user. So let us set a couple of breakpoints and see what happens. Set one breakpoint at: bp launchpad!CLaunchContext::Launch and the other at: launchpad!PhysicalUserContext::LaunchProcess. Then execute the code again.

When executing we see how we first break at bp launchpad!CLaunchContext::Launch, followed by launchpad!PhysicalUserContext::LaunchProcess, and we still have not had any exceptions. If you now hit F5, you will hit the exceptions immediately. So it seems that the launchpad!PhysicalUserContext::LaunchProcess, is where it happens; where we try and load the R runtime.

To confirm this we can copy back the R_SERVICES directory to where it is supposed to be, and while still having a breakpoint at launchpad!PhysicalUserContext::LaunchProcess execute the code again. When you hit the breakpoint, go to process explorer and have a look at running processes where the process is named something with R. On my machine it looks something like so:

Figure 9:Before LaunchProcess

After having continued the debugger, it looks like so:

Figure 10:After LaunchProcess

In Figure 10 we now see how some instances of RTerm.exe has been spun up. In next post we'll look at why there are multiple instances.

Summary

We should now have a somewhat better understanding of what happens when the launchpad service is called from SQL Server. In Figure 11 below shows some of the significant events/calls when sp_execute_external_script is executed:

Figure 11:Launchpad Service Call Flow

The flow is something like this:

A call comes in from SQL Server.
Enters the launchpad process, and workers, schedulers, tasks, etc., comes into play.
Eventually sqldk!SOS_Scheduler::RunTask is called.
Named pipe connection is accepted and opened.
More or less in parallel launchpad!CLaunchContext::LaunchServTask is called.
We get into launchpad!CLaunchContext::Launch.
The launchpad!CLaunchContext::Launch routine calls into the launcher.
Finally launchpad!PhysicalUserContext::LaunchProcess is called and the RTerm process is started.

Some of the above are somewhat educated guesswork since I don't have the symbol file for RLauncher.dll. I do however believe it is more or less accurate. In the next internals post we'll look at what happens in the RTerm process.

If you have comments, questions etc., please comment on this post or ping me.

↧

Interesting Stuff - Week 13

April 2, 2017, 7:39 am

≫ Next: Interesting Stuff - Week 14

≪ Previous: Microsoft SQL Server R Services - Internals II

Throughout the week, I read a lot of blog-posts, articles, etc., that has to do with things that interest me

data science
data in general
distributed computing
SQL Server
transactions (both db as well as non db)
and other "stuff"

This is the "roundup" of the posts that has been most interesting to me, this week.

Streaming

Stream Processing & Analytics with Flink @Uber. A presentation about how Uber builds its next generation of stream processing system to support real-time analytics as well as complex event processing.
Scaling up Near Real-time Analytics @Uber &LinkedIn. Another presentation involving Uber. This time in combination with LinkedIn, and how they use Apache Samza, Apache Calcite and Pinot.
Real-time Recommendations using Spark Streaming. Presentation about how Netflix uses Kafka, Spark Streaming, and Cassandra for real-time recommendations.
Spotify’s Event Delivery – The Road to the Cloud (Part I). First post in a series about what Spotify is handling event delivery. Really interesting!

SQL Server

Context in perspective 6: Taking Sessions to Task. The sixth post by Ewald about context in SQL Server. As usual it is a treasure trove of SQL Server internals information!
SQLskills SQL101: Partitioning. Kimberly continues SQLskills SQL101 series with a post about partitioning.
SQL Server On Linux: Core-minidumps and Breakpad. An awesome post by Bob Dorr about debugging SQL Server on Linux and core dumps.
Microsoft SQL Server R Services - Internals II. Earlier today I finished the third post in my series about Microsoft SQL Server R Services. This post being the second about the internals. The post drills down into the launchpad service.

Data Science

UK government using R to modernize reporting of official statistics. A post from Revolution Analytics how the UK government uses R.
Learning Scrabble strategy from robots, using R. Another post from Revolution Analytics, this time about how R has been used to analyze Scrabble games.

That's all for this week. I hope you enjoy what I did put together. If you have ideas for what to cover, please comment on this post or ping me.

↧

Interesting Stuff - Week 14

April 9, 2017, 5:51 am

≫ Next: Microsoft SQL Server R Services - Internals III

≪ Previous: Interesting Stuff - Week 13

Throughout the week, I read a lot of blog-posts, articles, etc., that has to do with things that interest me

data science
data in general
distributed computing
SQL Server
transactions (both db as well as non db)
and other "stuff"

This is the "roundup" of the posts that has been most interesting to me, this week.

SQL Server

SQL Server doubly linked lists revisited. Ewald chases bugs in linked lists!

Distributed Computing

Reactive & Asynchronous - Adventures with APIs in Financial Trading. Michael Barker from LMAX and the father of Disruptor gives a presentation about low-latency APIs used for financial trading and how they compare to APIs based on HTTP, REST,JSON etc.

Streaming

From Streams to Tables and Back Again: An Update on Apache Flink's Table & SQL API. In previous weekly roundups (here and here) I have mentioned Flink and it's new APIs. This is a blog-post discussing the Table API more.

Data Science

Microsoft Updates its Deep Learning Toolkit. A post announcing the update of its "Microsoft Cognetive Toolkit". It looks very, very interesting, and when I get some spare time (hah) I'll take it for a spin.
Publish R functions as stored procedures with the sqlrutils package. Revolution Analytics blogs about the sqlrutils package and functions in the package helping you make stord procedures out of your R functions. Very cool!
Latest Rev of Utilities for Microsoft Team Data Science Process (TDSP) Now Available. Another update from Microsoft. This time it is the Team Data Science Process Utilities.
Deep Learning at Scale. Presentation about the Deep Scalable Sparse Tensor Network (DSSTNE). This kind of blew my mind.
Building Robust Machine Learning Systems. Another presentation giving food for thought; how to make sure our machine learning systems are robust, well audited and can be inspected.
Microsoft R Open 3.3.3 now available. Revolution Analytics reports of the release of Microsoft R Open 3.3.3.

In other news; I am still working on Microsoft SQL Server R Services - Internals III, I hope the post will be out early this coming week. In the meantime you can always re-read I and II.

That's all for this week. I hope you enjoy what I did put together. If you have ideas for what to cover, please comment on this post or ping me.

↧

Microsoft SQL Server R Services - Internals III

April 12, 2017, 9:13 pm

≫ Next: Interesting Stuff - Week 15

≪ Previous: Interesting Stuff - Week 14

This post is part of a series of blog-posts about Microsoft SQL Server R Services:

Microsoft SQL Server 2016 R Services Installation
Microsoft SQL Server R Services - Internals I
Microsoft SQL Server R Services - Internals II
Microsoft SQL Server R Services - Internals III (this post)
More to come (hopefully)

This post is the fourth post about Microsoft SQL Server R Services, and the third post that drills down into the internal of how it works. In the previous internals posts here and here we have looked at what goes on inside SQL Server when executing an R script, and what happens in the launchpad service when executing.

This post was initially to be about the R runtime, and what other parts are involved when executing R scripts in SQL Server. However, during my "spelunking" around I realized that I did not really understand what happens when we create the processes for the R runtime from the launchpad service and the launcher (which is what I covered in the internals II post).

So, instead of looking at the R runtime and the other R related components, we'll look at what happens when creating external processes. A subsequent blog-post will cover the R runtime and components.

To begin with, the code we use in this post to execute R scripts are the same as we have used in the other posts:

Execution of Procedure

EXECsp_execute_external_script
@language=N'R',
@script=N'OutputDataSet<-InputDataSet',
@input_data_1=N'SELECT 42'
WITHRESULTSETS(([TheAnswer]intnotnull));
GO

Code Snippet 1:Execute sp_execute_external_script

The code in Code Snippet 1 doesn't do much, but it is quite sufficient for our purposes. Having said that, let us get on with it.

Recap

Let us refresh our memories about what the two previous internals posts covered, by looking at some images from those two posts. In internals I we said that the following happened when executing what Code Snippet 1 shows:

Figure 1:Call Flow Executing sp_execute_external_script

In Figure 1 we see how an named pipe connection is opened from the SQL Server engine into the launchpad service, and eventually the routine sqllang!CSQLSatelliteConnection::WriteMessage writes a message to the service. The message will at one stage or another cause the launchpad!CLaunchContext::Launch routine in the launchpad service to be called.

Figure 2 below, from the internals II shows the flow the call in Figure 1 enters the launchpad service:

Figure 2:Launchpad Service Call Flow

Various routines are called, and eventually launchpad!PhysicalUserContext::LaunchProcess is called followed by KERNELBASE!CreateProcessInternalW which creates the RTerm.exe process, which you can see in Figure 3. The script to be executed is sent to the R process in the launchpad!CSQLSatelliteConnection::WriteMessage call.

In the beginning of this post I mentioned that I had changed what to cover in this post, and what you see in Figure 3 is partly why I did that:

Figure 3:RTerm Process

We see a number of RTerm processes, why is that? Let us see if we can answer that question as well as see what happens when we create the processes as well as when the R script code executes. To understand how that works we should however first talk about users and user accounts.

User Accounts

When we execute something in SQL Server we do it in context of a user, either a SQL Server user or Windows user, and SQL Server has the concept of ISOLATION LEVELS to ensure that the same code can be executed concurrently. When we execute sp_execute_external_script however, we exit SQL Server and the actual execution takes place outside of the SQL Server engine; inside the external runtime. The external engine does not have the notion of ISOLATION LEVELS, so if multiple users executed the same script concurrently, bad things could potentially happen. How can we ensure isolation between users executing concurrently?

To ensure isolation, SQL Server R Services creates during installation a pool of Windows accounts in a Windows account group. The group (and accounts) are created per instance of SQL Server R Service. The group for SQL Server R Services installed on the default SQL Server instance is SQLRUserGroup and for non-default instances it is SQLRUserGroupInstanceName. On my machine I have three installations of SQL Server, with SQL Server R Services installed on all of them, and Figure 4 shows the groups that have been created:

Figure 4:SQL Server R Services User Groups

So what does a user group look like then? If we double click on the SQLRUserGroup, we'll see something like so:

Figure 5:Members in User Group

The installation of SQL Server R Services creates by default 20 user accounts, named MSSQLSERVER01...20 for the default instance, and InstanceName01...20 for named instances. When executing external scripts the executing user will be mapped to one of these accounts by the launchpad service.

NOTE: The number of accounts created can be altered. See this MSDN post for information about that.

While executing external scripts, there may be a need to store script objects, intermediate results, etc. For this reason the installation of SQL Server R Services not only creates the user accounts as per above, but it also creates folders for each account where to store the objects, results etc. The folders are created at: C:\<path_to_sql_server_instance>\MSSQL\ExtensibilityData, and are named as the user account. In Figure 6 below you see an example of this:

Figure 6:Folders for User Accounts

It is not entirely correct to say that the files, results, etc., are stored directly in the user folder. They are in fact stored in sub-folders of the user folder. Let us do some coding to see an example of the mapping to a user account as well as sub-folders:

Stop the launchpad service if it is running.
Delete any sub folders of the user account folders in the C:\<path_to_sql_server_instance>\MSSQL\ExtensibilityData directory. Do NOT delete the user account folders themselves.
Restart the launchpad service.
Execute the code in Code Snippet 1.

When you have executed go to the C:\<path_to_sql_server_instance>\MSSQL\ExtensibilityData directory and check in the various user account folders. If you are the only user on the server, you should now see in the xxx01 folder some sub-folders as in Figure 7 below:

Figure 7:Sub-folders of the User Account Folder

If two SQL Server users had executed scripts concurrently two of the user folders would have had sub-folders. If you open one of the sub-folder you should see some files and a folder. This comes from what we mentioned above, about storage of files etc.

The question is why are there more than one sub-folder, since we only executed the code once? The same question can be asked when you look at Figure 3, the RTerm processes; why do we have more than one RTerm process for one execution? The other question to ask is if the multiple RTerm processes are related to the multiple sub-folders? We can probably guess that is the case, but can we prove it?

NOTE: Ina future blog-post I'll look into what the files are that are stored in the sub-folders.

RTerm Processes

The assumption is that there is something in common between the user account's sub-folders and the multiple RTerm processes we saw in Figure 3. Let us execute some code and see if we can figure out if our assumption is right. For this we should change the code in Code Snippet 1 slightly, to inject an artificial pause in the execution. That way it should be easier for us to see what is going on. The code to execute now looks like so:

Script with Pause

EXECsp_execute_external_script
@language=N'R',
@script=N'OutputDataSet<-InputDataSet;
                              Sys.Sleep(120);',
@input_data_1=N'SELECT 42'
WITHRESULTSETS(([TheAnswer]intnotnull));

Code Snippet 2:Execute with Sys.Sleep

As you can see in Figure 2, the code looks almost the same as in Code Snippet 1, apart from that we have injected a Sys.Sleep(120), which will make the execution stop for 2 minutes. That should give us ample time to do some "spelunking".

NOTE: For this investigation I use SysinternalsProcess Explorer.

So:

Stop the launchpad service.
Delete any sub folders of the user account folders in the C:\<path_to_sql_server_instance>\MSSQL\ExtensibilityData directory. Do NOT delete the user account folders themselves. Keep the File Explorer open at the ExtensibilityData directory.
Restart the launchpad service.
Start Process Explorer, order by Process, and scroll down to where you see process names starting with "RT" (on my box there are none at this stage), or where the processes should be.
- If you at this stage see RTerm, restart the launchpad service again and kill those processes.
Execute the code in Code Snippet 2.

While the code is running, take a quick look in Process Explorer, and you should see something like so:

Figure 8:RTerm Processes

You will see multiple RTerm processes running, quickly right click on each of them and you'll see some properties like in the figure below:

Figure 9:RTerm Process Properties

When you look in the properties pop-up window, you can see the part highlighted in Figure 9, is actually the path and folder name of one of the sub-folders of the user account:

Figure 10:User Account Sub-folders

Figure 10 is just to prove that the above is correct. So it seems that our assumption was correct and that there is a one-to-one relationship between the RTerm processes and the sub-folders.

It looks like we have answered the question above about correlation between the sub-folders and the RTerm processes, but what about the question why? Well, in my opinion it is for performance reasons. By having a pool of processes for a user, it should be quicker to to execute multiple scripts: a new process does not have to be "spun up" for the execution but can be taken from the process pool.

So how does this work under the covers?

Process Pools

In the internals II post I covered how an external runtime is launched. However we are not launching a runtime as such, we are launching multiple runtime processes (in this case RTerm.exe). So even if the previous post is correct, there are more things going on that what was covered there, and that is related to the multiple processes and the process pool I mentioned above. The way I came across this was due to me pondering why there are multiple RTerm processes launched, and if there was any logic behind the number. I quite often saw 5 processes, and 5 or 6 sub-folders.

My initial thought was that this is stored in some config file, and the obvious file would be for the launcher(s), and we looked at this file in the previous post:

Figure 11:RLauncher Configuration

However, the file doesn't contain that much, and nothing that seems to relate to number of processes. There is the USER_POOL_SIZE setting, but it is set to 0, so I guess it is not that. When looking at the file, I saw the TRACE_LEVEL setting, and I browsed through the web and came across this post, which briefly discusses rlauncher.config, In that post they mention the TRACE_LEVEL setting which is used to configure the trace verbosity level of the launchpad service, and the traces stored in the log file for the launchers:

1 = Error (default)
2 = Performance
3 = Warning
4 = Information

Maybe if we change the TRACE_LEVEL to 4, we'd be able to get more information. I stopped the launchpad service, and opened the config file as an administrator and changed the TRACE_LEVEL to 4, and saved it. I then deleted all log files in the C:\<path_to_slq_server_instance>\MSSQL\Log\ExtensibilityLog, and restarted the launchpad service.

When opening the rlauncher.log file, there is quite a lot if interesting information, and in the code snippet below I have selected out some of the more interesting bits and pieces:

Log

[2017-04-09 05:37:50.253]
 # "stuff" from config
 File=C:\<path_to_sql_server_Instance>\MSSQL\Binn\rlauncher.config
 RHome=C:\<path_to_sql_server_Instance>\R_SERVICES
 MpiHome=C:\Program Files\Microsoft MPI
 InstanceName=MSSQLSERVER
 LogDirectory=
      C:\<path_to_sql_server_Instance>\MSSQL\LOG\ExtensibilityLog
 WorkingDirectory=C:\PROGRA~1\MICROS~2\MSSQL1~1.MSS\MSSQL\EXTENS~1
 WorkingDirectoryLongPath=
      C:\<path_to_sql_server_Instance>\MSSQL\ExtensibilityData
 1.
 SqlSatellitePath=
      C:\<path_to_sql_server_Instance>\MSSQL\Binn\sqlsatellite.dll
 SqlSatelliteRPath=
      C:\<path_to_sql_server_Instance>\MSSQL\Binn\sqlsatellite.dll
 SqlSatelliteDirectory=C:\<path_to_sql_server_Instance>\MSSQL\Binn
 2.
 ProcessPoolingEnabled=1
 ProcessRecycleEnabled=0
 StaleProcessTime=300000 msecs
 StaleProcessPollTime=60000 msecs
 TelemetryFlushInterval=300000
 3.
 ProcessPoolSqlSatelliteGrowth=5
 ProcessPoolRxJobGrowth=3

Code Snippet 3:Excerpt from rlauncher.log

So, some comments about the above:

We have some log entries about SqlSatellite. This is an API to support external code and external run times. We will see more about it in later blog-posts.
Process pooling, settings if it is enabled and when a process is considered stale. In this case a process is considered stale after 5 minutes of inactivity, and we poll every minute against the processes.
Aha, settings about size of the pool.

From the above we can see that the number of processes are not a random number. I still can't find where the actual number is stored, so my assumption is that it is hard coded into respective launcher.

If you remember from the internal II post, I said that I believe that the launcher(s) are loaded when the launchpad service starts up. At the end of the log-file we see some entries that points to that as well:

Log

SQLSatellite_InitLauncher(600000, 1, 1,
    C:\<path_to_sql_server_Instance>\MSSQL\LOG\ExtensibilityLog,
    C:\<path_to_sql_server_Instance>\MSSQL\Binn\sqlsatellite.dll,
    C:\<path_to_sql_server_Instance>\MSSQL\ExtensibilityData)
    completed: 00000000
< SQLSatellite_InitLauncher, dllmain.cpp, 157 (4 msecs)
> SQLSatellite_RegisterLaunchContext, dllmain.cpp, 209
    SQLSatellite_RegisterLaunchContext(000000664E9FF750)
               completed: 00000000
< SQLSatellite_RegisterLaunchContext, dllmain.cpp, 209 (0 msecs)
> SQLSatellite_GetSupportedScriptTypes, dllmain.cpp, 107
    SQLSatellite_GetSupportedScriptTypes(1) completed: 00000000
< SQLSatellite_GetSupportedScriptTypes, dllmain.cpp, 107 (0 msecs)

Code Snippet 4:Initialization of Launcher(s)

The way I read what is in Code Snippet 4 is that at the very end of the launchpad service startup, launchers are initialized, and information about what script types are supported is retrieved.

We have now seen what happens when the launchpad service is started, and have drawn some conclusions from that. What will the log-file tell us when we execute some code:

Stop the launchpad service.
Delete any sub folders of the user account folders in the C:\<path_to_sql_server_instance>\MSSQL\ExtensibilityData directory. Do NOT delete the user account folders themselves.
Delete all log files in the C:\<path_to_slq_server_instance>\MSSQL\Log\ExtensibilityLog
Restart the launchpad service.
Execute the code in Code Snippet 1 (the code without pause).

When looking through the log-file, it is a LOT of information logged. I have tried to excerpt the most important bits below:

Log-file

1.
Session(3291DD7C-4451-48E1-9838-C5A0DF67FA74)
   CleanupOnExit=1, Settings.JobCleanupOnExit=1
Session 3291DD7C-4451-48E1-9838-C5A0DF67FA74
   assigned to MSSQLSERVER01 user
2.
ProcessPool(MSSQLSERVER01-ss-2) with minimum processes 5 created
3.
WorkingDirectory(C:...\MSSQLSERVER01\5F3BEB48-4E0A-4D0C-ACCF-3D8C6EB972EF) created (1)
WorkingDirectory(C:...\MSSQLSERVER01\E7AB7781-A8C7-421F-BAA0-074075B41082) created (1)
4.
CreateProces(C:...\R_SERVICES\bin\x64\rterm.exe
    --slave --no-restore --no-save
    -e "library(RevoScaleR);
    sessionDirectory <- 'C:\...\5F3BEB48-4E0A-4D0C-ACCF-3D8C6EB972EF';
    sessionId <- '5F3BEB48-4E0A-4D0C-ACCF-3D8C6EB972EF';
CreateProces(C:...\R_SERVICES\bin\x64\rterm.exe
    --slave --no-restore --no-save
    -e "library(RevoScaleR);
    sessionDirectory <- 'C:\...\E7AB7781-A8C7-421F-BAA0-074075B41082';
    sessionId <- 'E7AB7781-A8C7-421F-BAA0-074075B41082';
5.
Assigning PooledProcess(..., ..., 5F3BEB48-4E0A-4D0C-ACCF-3D8C6EB972EF,
    C:...\R_SERVICES\bin\x64\rterm.exe)
    to Job PooledProcess-5F3BEB48-4E0A-4D0C-ACCF-3D8C6EB972EF(0000081C)
Assigning PooledProcess(..., ..., E7AB7781-A8C7-421F-BAA0-074075B41082,
    C:...\R_SERVICES\bin\x64\rterm.exe)
    to Job PooledProcess-E7AB7781-A8C7-421F-BAA0-074075B41082(00000804)
6.
Session[3291DD7C-4451-48E1-9838-C5A0DF67FA74]
    attached to pooled processes [5F3BEB48-4E0A-4D0C-ACCF-3D8C6EB972EF]
7.
WorkingDirectory(C:...\A379EE0B-385A-41A9-86C3-A5C9D1FFDE7F) created (1)
CreateProces
    C:...\R_SERVICES\bin\x64\rterm.exe
    ...
    sessionId <- 'A379EE0B-385A-41A9-86C3-A5C9D1FFDE7F';
8. This is interleaved with 7
SQLSatellite_LaunchSatellite(1, 3291DD7C-4451-48E1-9838-C5A0DF67FA74,
    1, 51008, nullptr, 000000CC5ADFE720,
    C:...\ExtensibilityData\MSSQLSERVER01) completed: 00000000  
7a. here we continue with creation of processes started at 7
Assigning PooledProcess(..., ..., A379EE0B-385A-41A9-86C3-A5C9D1FFDE7F,
    C:...\R_SERVICES\bin\x64\rterm.exe)
    to Job PooledProcess-A379EE0B-385A-41A9-86C3-A5C9D1FFDE7F(00000854)
...
9.
ProcessPool(MSSQLSERVER01-ss-2) adding PooledProcess(..., ...,
    E7AB7781-A8C7-421F-BAA0-074075B41082,
    C:...\R_SERVICES\bin\x64\rterm.exe)
ProcessPool(MSSQLSERVER01-ss-2) adding PooledProcess(..., ...,
    A379EE0B-385A-41A9-86C3-A5C9D1FFDE7F,
    C:...\R_SERVICES\bin\x64\rterm.exe)
ProcessPool(MSSQLSERVER01-ss-2) adding PooledProcess(..., ...,
    8CE5541F-3A70-4538-9329-9A74FC0580DE,
    C:...\R_SERVICES\bin\x64\rterm.exe)
ProcessPool(MSSQLSERVER01-ss-2) adding PooledProcess(..., ...,
    97608E5B-B70D-4E69-9D6D-56BEC2FC0F81,
    C:...\R_SERVICES\bin\x64\rterm.exe)
ProcessPool(MSSQLSERVER01-ss-2) adding PooledProcess(..., ...,
    74E3C23C-56B6-4A16-B1E3-E4C00335FB97,
    C:...\R_SERVICES\bin\x64\rterm.exe)
10.
Job PooledProcess-5F3BEB48-4E0A-4D0C-ACCF-3D8C6EB972EF(0000081C)
    WaitAll(1, 5000) completed with 0
Job PooledProcess-5F3BEB48-4E0A-4D0C-ACCF-3D8C6EB972EF(0000081C)
    destroyed
~WorkingDirectory
    (C:...\MSSQLSERVER01\5F3BEB48-4E0A-4D0C-ACCF-3D8C6EB972EF)
    deleted (0)
11.
Session 3291DD7C-4451-48E1-9838-C5A0DF67FA74
   removed from MSSQLSERVER01 user
Session(3291DD7C-4451-48E1-9838-C5A0DF67FA74)
    [SqlSatellite] deleted. Elapsed time: 1222 msecs

Code Snippet 5:Log when Executing Script

Geez, quite a lot of information, and - as I mentioned above - I have tried to just show the log entries that I deem important. The entries themselves are also abbreviated quite a bit. So what do we see:

We create a session for this request, and assigns the session to the executing user (MSSQLSERVER01).
A pool for the processes is created.
Two working directories (sub-folders) are created for the user (MSSQLSERVER01).
Two physical RTerm process are created and assigned the id of the sub-folders.
The processes are added to a job object (which has been created prior to this).
The session is attached to one of the two processes. This process will now execute the request.
Four new processes and working directories are created. Interleaved with this is point 8 (below). The existing session is not assigned to any of these. At this stage there are 6 processes in total, two from the original process creation and these four new ones. Five of these processes do not have any session attached.
From what I can gather, here is where the request is actually executed.
The 5 processes without the session are added to the process pool, and are available for new requests from the same user.
The process which executed the request finishes and is being destroyed.
The session is removed and deleted.

From the above we can see how processes are created and then being pooled. The processes will eventually be torn down if they are not in use, as per the StaleProcessTime in Code Snippet 3.

How does all this now fit into what we covered in the internals II post, and Figure 2 above?

WinDbg

Well, let us try to find out. For this we use our trusty WinDbg (go back to the internals I post if you need a re-fresher about WinDbg):

Stop the launchpad service.
Delete any sub folders of the user account folders in the C:\<path_to_sql_server_instance>\MSSQL\ExtensibilityData directory. Do NOT delete the user account folders themselves.
Delete all log files in the C:\<path_to_slq_server_instance>\MSSQL\Log\ExtensibilityLog
Restart the launchpad service.
Attach WinDbg to the launchpad process.

As we did in the internals I post, let's hunt for routines from the symbols. When looking at the log files above we see some references to SQLSatellite and in previous posts we have also seen satellite references. So, let us do a very coarse search: x /n launchpad!*Satellite*. By using the /n flag you sort everything by name, which can make things a bit more readable. Looking through what was returned, I have below listed some classes that looks interesting as well a couple of independent routines:

Satellite

//classes
launchpad!CSQLSatelliteCommunication
launchpad!CSQLSatelliteConnection
launchpad!CSQLSatelliteMessage
launchpad!CSatelliteCargoContext
launchpad!CSatelliteRuntimeContext
launchpad!SatelliteJobObject
launchpad!SatelliteSession
launchpad!SatelliteSessionManager
launchpad!Satellite_ResourceManager
//independent routines
launchpad!CreateProcessForSatelliteSession
launchpad!SQLSatellite_Init
launchpad!SQLSatellite_InitLaunchContext
launchpad!SQLSatellite_LaunchProcess

Code Snippet 6:Interesting Output

Having identified interesting classes and function as in Code Snippet 6, it is time to fine interesting routines, and through trial and error (setting breakpoints and executing code) finding out what is happening. I eventually arrived at the following (please remember from the previous post that I do not have symbols for the RLauncher.dll):

Call Chain

1.launchpad!CLaunchContext::LaunchServTask
2.launchpad!SatelliteSessionManager::ConstructSatelliteSession
2.launchpad!SatelliteJobObject::CreateSatelliteJobObject
2.launchpad!SatelliteSessionManager::CreateNewSessionObject
3.launchpad!CLaunchContext::Launch
4.Rlauncher!"MiscCalls"
4.launchpad!SQLSatellite_LaunchProcess
4.launchpad!CreateProcessForSatelliteSession
4.launchpad!PhysicalUserContext::LaunchProcess
4.KERNELBASE!CreateProcessInternalW
4.Rlauncher!"MiscCalls"
4.launchpad!SQLSatellite_LaunchProcess
4.launchpad!CreateProcessForSatelliteSession
4.launchpad!PhysicalUserContext::LaunchProcess
4.KERNELBASE!CreateProcessInternalW
4.launchpad!Satellite_ResourceManager::AssociateProcessToJobObject
4.launchpad!Satellite_ResourceManager::AssociateProcessToJobObject
4.launchpad!SatelliteJobObject::AssociateProcess
4.launchpad!SatelliteJobObject::AssociateProcess
4.Rlauncher!"MiscCalls"
4.launchpad!SQLSatellite_LaunchProcess
4.launchpad!CreateProcessForSatelliteSession
4.launchpad!PhysicalUserContext::LaunchProcess
4.KERNELBASE!CreateProcessInternalW
5.launchpad!CSQLSatelliteCommunication::SendResumeWithLoginInfo
5.launchpad!CSQLSatelliteConnection::WriteMessage
4.launchpad!Satellite_ResourceManager::AssociateProcessToJobObject
4.launchpad!SatelliteJobObject::AssociateProcess
6.Rlauncher!"MiscCalls"
6.launchpad!SQLSatellite_LaunchProcess
6.launchpad!CreateProcessForSatelliteSession
6.launchpad!PhysicalUserContext::LaunchProcess
6.KERNELBASE!CreateProcessInternalW
6.launchpad!Satellite_ResourceManager::AssociateProcessToJobObject
6.launchpad!SatelliteJobObject::AssociateProcess
6.Rlauncher!"MiscCalls"
6.launchpad!SQLSatellite_LaunchProcess
6.launchpad!CreateProcessForSatelliteSession
6.launchpad!PhysicalUserContext::LaunchProcess
6.KERNELBASE!CreateProcessInternalW
6.launchpad!Satellite_ResourceManager::AssociateProcessToJobObject
6.launchpad!SatelliteJobObject::AssociateProcess
6.Rlauncher!"MiscCalls"
6.launchpad!SQLSatellite_LaunchProcess
6.launchpad!CreateProcessForSatelliteSession
6.launchpad!PhysicalUserContext::LaunchProcess
6.KERNELBASE!CreateProcessInternalW
6.launchpad!Satellite_ResourceManager::AssociateProcessToJobObject
6.launchpad!SatelliteJobObject::AssociateProcess
7.launchpad!SatelliteSessionManager::DestroySatelliteSession
7.launchpad!SatelliteSessionManager::RemoveSessionObjectFromStore

Code Snippet 7:High Level Call Chain

From Code Snippet 7 we see how we:

Call LaunchServTask
Creating a satellite session, a job object and a session object.
We then come into the code path were we launch the launcher and create the RTerm processes: launchpad!CLaunchContext::Launch.
In there we do calls to the launcher dll (Rlauncher!"MiscCalls"), create the actual processes and assigns that to the job object. We initially create two processes, and one of the processes will be mapped to the session (as we saw from the log-file).
The calls launchpad!CSQLSatelliteCommunication::SendResumeWithLoginInfo and launchpad!CSQLSatelliteConnection::WriteMessage is where the script is sent to the R process and executed. These calls are interleaved with the create process calls.
We continue creating processes until we have six in total.
The executing process (and session) is torn down when the execution has finished.

NOTE: While the code executes we may see six RTerm processes.

Summary

So, what does all this come to then? We have Figure 2 which originates from the internals II post, where we looked into what is going on in the launchpad service. In this post we have seen the log files in Code Snippet 3, 4, 5 and then we have the WinDbg output in Code Snippet 7. In Figure 12 I try to summarize what we have discussed in this post:

Figure 12:Summary

In Figure 12 we see:

How there are backing folders in C:\<path_to_sql_server_instance>\MSSQL\ExtensibilityData for the group of user accounts that are mapped to SQL Server users when executing code via sp_execute_external_script. These backing folders will be storage for files etc., when executing.
- The files are not placed directly in the account folder but in sub-folders created when the processes are created (see below).
When executing we, as in Code Snippet 7, call launchpad!CLaunchContext::LaunchServTask, and create a satellite session and a session object.
We then go on and start creating working directories (the sub-folders mentioned above), and processes. In the figure the working directories are named WorkingDir1 etc., whereas in reality the names are Guid values. The processes are RTerm processes and are assigned the same Guid as the working directory name. So a RTerm process always executes "under" one user account sub-folder.
After two initial directories and processes have been created, one of the two processes are assigned to the created session.
When the process has been assigned to the session, the script is executed via; launchpad!CSQLSatelliteCommunication::SendResumeWithLoginInfo and launchpad!CSQLSatelliteConnection::WriteMessage. This normally happens while the third process is created.
Processes and working directories 4 - 6 are created. These processes are available for subsequent executions.
When the execution has finished the session is torn down together with the process.
- At this stage we now have 5 processes running, and 5 working folders.

As was the case in the internals II post, some parts of this post is educated guesswork. If anyone have more information I wild be more than happy to correct any inaccuracies. In either case, I hope you have enjoyed this journey in the "bowels" of the launchpad service.

If you have comments, questions etc., please comment on this post or ping me.

↧

Interesting Stuff - Week 15

April 16, 2017, 11:53 am

≫ Next: SQL Server 2017 - Python Executing Inside SQL Server

≪ Previous: Microsoft SQL Server R Services - Internals III

Throughout the week, I read a lot of blog-posts, articles, etc., that has to do with things that interest me:

data science
data in general
distributed computing
SQL Server
transactions (both db as well as non db)
and other "stuff"

This is the "roundup" of the posts that has been most interesting to me, this week.

This week I do not have that much material, but there are still some interesting "stuff".

Distributed Computing

Our Concurrent Past; Our Distributed Future. In the roundup for week 11, I had a link to a summary of Joe Duffy's keynote at QCon in London. This is the actual presentation, video and slides.

.NET

The CLR Thread Pool 'Thread Injection' Algorithm. Ah, deep, "down in the plumbing" of .NET stuff. Matt Warren discusses the Hill Climbing algorithm, and how it is used to control the rate at which threads are added to the CLR thread pool. Oh and if you are interested in the plumbing of .NET/CLT Matt's blog is a treasure trove, full of information.
Azure Service Fabric SDK Becomes Open Source. An InfoQ article about, how, Microsoft open sourcing parts of its Service Fabric SDK.

Microsoft Azure

Azure Functions now has direct integration with Application Insights. A blog-post from the Microsoft App Service Team how Azure Functions now support integration with Azure Application Insights. This is cool on so very many levels, and you guys should really check it out!

Streaming

Continuous Queries on Dynamic Tables: Analyzing Data Streams with SQL. It seems that each weeks "roundup", has something about Apache Flink, and it's dynamic tables and SQL support. This is yet another blog-post about it. It is really cool "stuff"!

SQL Server

Saving input and output with sp_execute_external_script. Tomaz is, as myself, playing around with SQL Server R Services, and in this blog-post he tries to figure out how to store and save the R code being executed with sp_execute_external_script. It is not as straightforward as it may seem!
Microsoft SQL Server R Services - Internals III. I finally managed to finish the third part of the Microsoft SQL Server R Services - Internals "saga". This "episode" looks at how the launchpad service creates various processes.

Data Science

New features in the checkpoint package, version 0.4.0. Revolution Analytics blogs about the new version of their checkpoint package. The package is designed to make it easy to write reproducible R code by allowing you to go backward (or forward) in time to retrieve the exact versions of the packages you need.
The Team Data Science Process. Buck Woody writes about the Microsoft Team Data Science Process.

That's all for this week. I hope you enjoy what I did put together. If you have ideas for what to cover, please comment on this post or ping me.

↧

SQL Server 2017 - Python Executing Inside SQL Server

April 20, 2017, 11:49 am

≫ Next: Microsoft SQL Server R Services - Internals IV

≪ Previous: Interesting Stuff - Week 15

On April 19, 2017 Microsoft held an on-line conference Microsoft Data Amp to showcase how Microsoft’s latest innovations put data, analytics and artificial intelligence at the heart of business transformation. The keynotes speakers were Scott "Red Shirt" Guthrie who is an Microsoft Executive Vice President (fairly high on the food-chain), and Joseph Sirosh a Corporate Vice President, head of the Information Management and Machine Learning group. Joseph probably knows a thing or two about data, and Scott - well, Scott knows A LOT!

During the keynotes, Scott and Joseph shared how Microsoft’s latest innovations put data, analytics and artificial intelligence at the heart of business transformation. A big enabler for this is SQL Server 2017 (the "artist" formerly known as SQL Server V.Next), which introduces a lot of very cool new "stuff".

The keynote speeches were followed by recorded short:ish sessions drilling down into certain aspects of Microsoft's new offerings. If you are interested in the various presentations at Microsoft Data Amp, they all are on Channel 9.

What caught my eye was the announcement of how SQL Server 2017 now supports Python as an extensible script engine. When Microsoft introduced support for R in SQL Server 2016, rumors immediately surfaced how other engines also would be supported, where Python was high on the list (Julia another one).

Seeing that I am somewhat into SQL Server R Services (the SQL Server 2016 flavor) here, here, here, here, and more posts to come, I really had to have a look.

The rest of this post is a brief introduction to executing Python code in SQL Server 2017, in essence I just want to be able to execute some sort of Python code.

Installation

The installation steps of external scripts engines doe not differ much (at all) from the installation in SQL Server 2016, apart from the fact you now can install Python. So if you feel you need more information abut the install process, please go and read my Microsoft SQL Server 2016 R Services Installation post.

I downloaded the SQL Server 2017 CTP 2.0, and after battling trying to register (I assume a couple of other people had the same idea as myself), I managed to get an iso down to my machine and installed it as an instance. During Feature Selection at installation time we can see that the promise of Python was not a lie. It is part of Machine Learning Services, as in Figure 1:

Figure 1:Python Option In Database

Looking further we also see how Python can be installed as a stand-alone engine:

Figure 2:Python Option In Database

The stand-alone installation option is useful, for example,if a data scientist want to run Python (or R) on his or her own machine.

NOTE: You may wonder why, in Figure 2, the check boxes for R as well as Python is checked, but "dimmed" out? This is because, on this particular machine, I have already installed the standalone versions once.

OK, so after successful installation, and if you have installed the SQL Server engine, R services and Python services you should see something like so in the C:\path_to_instance folder:

Figure 3:Finished Install

Now, let's test this "shiny new thing out". Oh, in this post I leave out all "grunge" about the launchpad service, etc. If you are interested go ahead and read about it in the previously mentioned post.

Execute Python Code in SQL Server 2017

After we have done the installation, but before we can execute the Python (or R) code we need to enable the execution of external scripts. As I covered in the post Microsoft SQL Server 2016 R Services Installation you enable external scripts by changing the configuration as in Code Snippet 2AND restarting the instance.

Enable External Scripts

EXECsp_configure'external scripts enabled',1
RECONFIGUREWITHOVERRIDE

Code Snippet 2:Execute sp_configure

After having enabled everything let's execute sp_execute_external_script. The code is very, very basic - more or less the least what we can get by with for Python. It does have some similarities with the code in the Microsoft SQL Server 2016 R Services Installation post, and Code Snippet 2 shows the code:

Execution of Python Script

EXECsp_execute_external_script
@language=N'Python',
@script=N'print("The Answer Is 42!!!")';
GO

Code Snippet 3:Test That Python Installation Works

In Code Snippet 3, we say that the language is Python (the @language parameter), and that the script we want to execute is a Python print statement: print("The Answer Is 42!!!"). That's all!

When executing this, something like so should be printed out in the Messages tab in SQL Server Management Studio:

Figure 4:Result from Python Execution

We have now executed our first Python code in SQL Server 2017. Just for "giggles", let us ensure that we still can execute R code:

Execution of R Script

EXECsp_execute_external_script@language=N'R',
@script=N'OutputDataSet<-InputDataSet',
@input_data_1=N'SELECT 42'
WITHRESULTSETS(([TheAnswer]intnotnull));
GO

Code Snippet 4:Test That Execution of R Still Works

That should work, and you will see something like so:

Figure 5:Result of Execution of R Script

That seemed to work OK, let's ship it!

Summary

We have in this post seen how SQL Server 2017 introduces support for execution if Python in addition to R scripts. There is obviously a lot more to the above than what I have covered in the above, and I will definitely come back to both Python as well as R in future posts.

If you have comments, questions etc., please comment on this post or ping me.

↧

Microsoft SQL Server R Services - Internals IV

April 23, 2017, 5:42 am

≫ Next: Interesting Stuff - Week 16

≪ Previous: SQL Server 2017 - Python Executing Inside SQL Server

This post is part of a series of blog-posts about Microsoft SQL Server R Services:

Microsoft SQL Server 2016 R Services Installation
Microsoft SQL Server R Services - Internals I
Microsoft SQL Server R Services - Internals II
Microsoft SQL Server R Services - Internals III
Microsoft SQL Server R Services - Internals IV (this post)
More to come (hopefully)

This post is the fifth post about Microsoft SQL Server R Services, and the fourth post that drills down into the internal of how it works. In Internals - III, I wrote about how the launchpad service creates multiple processes when executing an external script.

Seeing that some of the conclusions I came to was somewhat educated guesses, I asked you guys to correct me where I was in-correct and/or add more information. After that post - Bob Albright (@bob_albright) - wrote me an email and pointed me to some resources around process creation, as well as some demo code. Thanks Bob!

So today we'll drill even further into the creation of processes, and see how they are used.

Recap

In Internals - III, we talked about how, during installation of an R enabled SQL Server instance, 20 Windows accounts are created. These accounts are created for the purpose to be able to provide isolation between users when executing external scripts.

In addition to the Windows user accounts created during installation, folders named as the individual Windows accounts are also created in the c:\<sql_instance_path>\MSSQL\ExtensibilityData folder. These folders act as storage for files, results, objects, etc., during execution of an external script.

When a user executes an external script in SQL Server, that account is being mapped to one of the 20 Windows account created, and it is under that Windows account the external part of the script is executed. Subsequently the files, etc., mentioned above, ends up in that folder somewhere. I write somewhere, because it is not entirely correct to say that the files, results, etc., are stored directly in the user folder. They are in fact stored in sub-folders of the user folder.

During execution the launchpad service creates working directories (the sub-folders above) and processes, and assigns the working directories and processes the same names (Guid values).

Figure 1 below shows the flow when executing a script:

Figure 1:Flow when Executing a Script

As per the figure:

We see backing folders in c:\<sql_instance_path>\MSSQL\ExtensibilityData for the group of user accounts that are mapped to SQL Server users when executing code via sp_execute_external_script.
When executing, launchpad!CLaunchContext::LaunchServTask is called, and a satellite session and a session object is created.
Then the working directories (the sub-folders mentioned above), and RTerm processes are created. In the figure the working directories are named WorkingDir1 etc., whereas in reality the names are Guid values. The processes are RTerm processes and are assigned the same Guid as the working directory name. So a RTerm process always executes "under" one user account sub-folder.
After two initial directories and processes have been created, one of the two processes are assigned to the created session.
When the process has been assigned to the session, the script is executed via; launchpad!CSQLSatelliteCommunication::SendResumeWithLoginInfo and launchpad!CSQLSatelliteConnection::WriteMessage. This normally happens while the third process is created.
Processes and working directories 4 - 6 are created. These processes are available for subsequent executions.
When the execution has finished the session is torn down together with the process.
- At this stage we now have 5 processes running, and 5 working folders.

The above is in essence what Internals - III covered, and if you want all the "nitty-gritty", please read that post.

Processes

In the Internals - III, we figured out that, by default, the launchpad service creates 5 processes, plus the process that is used for execution, when executing an external script. In the post I assumed that the reason for creating 5 (well 6 actually), was for performance, and I also wondered where that magic number 5 came from - seeing I couldn't find it in any config files. That's where Bob's email comes in, as he pointed me to a blog-post by the SQL Server engineering team, a.k.a TIGER (cool name!).

So, that particular blog-post mentions that the number of processes spun up can be controlled by a setting in the rlauncher.config file: PROCESS_POOL_SQLSATELLITE_GROWTH. If not set, it defaults to 5, and in the end, when executing, the setting + 1 processes has been created as per above.

The post also "kind of" confirms that my assumption about performance being a reason for spinning up multiple processes is correct, considering that a user may execute concurrent requests and it takes around 100 ms to spin up a process. In Internals - III I mentioned how the processes that are created are added to a pool of processes. So, the assumption on my part is that when there are multiple processes available, a new request will not execute on a newly created process, but will use a process from the pool.

Let us see if we can confirm the points about the config setting as well as performance.

Controlling Number of Processes

Initially we'll begin with looking into the PROCESS_POOL_SQLSATELLITE_GROWTH setting and see if it has any effect on the number of processes being created. In Internals - III we looked at the number of processes having been spun up while the code was executing and we saw something like so:

Figure 2:RTerm Processes

So 6 processes alive while the code is executing. After the code has finished, the executing process is torn down, and we have 5 processes in the pool. That was without having changed any settings, so let's change the settings:

Stop the launchpad service.
Open the rlauncher.config file with your text editor of choice (you need to run the editor as administrator).

The config file looks something like what you see in Figure 3:

Figure 3:RLauncher Configuration

As you see, there is no PROCESS_POOL_SQLSATELLITE_GROWTH setting. Let us add the setting with a value of 15: PROCESS_POOL_SQLSATELLITE_GROWTH=15 and see what happens.

Save the config file after you have added the setting as per above.
Restart the launchpad service.

Well, it looks like the launchpad service started, so the setting is not causing any issues (yet). We'll now execute some code and try and figure out if more processes will be created. We use the same code as we did in Internals - III, where the code has a pause statement, so we can easier look at what is happening:

Script with Pause

EXECsp_execute_external_script
@language=N'R',
@script=N'OutputDataSet<-InputDataSet;
                              Sys.sleep(120);',
@input_data_1=N'SELECT 42'
WITHRESULTSETS(([TheAnswer]intnotnull));

Code Snippet 1:Execute with Sys.sleep

As in Internals - III, I use Process Explorer from Sysinternals. So, let's go ahead and see what happens:

Start Process Explorer, order by Process, and scroll down to where you see process names starting with "RT" (on my box there are none at this stage), or where those processes should be.
- If you at this stage see RTerm, restart the launchpad service again and kill those processes.
Execute the code in Code Snippet 1.

While the code is running, take a quick look in Process Explorer, and you should see something like so:

Figure 4:RTerm Processes after Setting Change

In Figure 4 you can now see 16 RTerm.exe processes running. Once again, the reason for 16 instead of 15 is that the launchpad service spins up the number it is supposed to, plus one more. After the execution has finished, you will see 15 RTerm processes.

So yes, the setting does have impact. If you want, you can now delete the setting from the config file and restart the launchpad service.

Process Pool Impact on Executions from the Same Session / Concurrent Executions

Above I mentioned that I thought that by spinning up these processes, we'll get a performance benefit when executing concurrently or if we are, under the same SPID, doing subsequent executions. After all, as mentioned above, the processes are added to a process pool, and they should then be available for usage. A bit like connection pooling in ADO.NET or thread pooling in the .CLR.

Same Session Multiple Execs

Let us start with looking what happens when doing multiple executions in the same SQL Server session (SPID).

So the way we will do this is to look at the process id of the RTerm processes, and the process id of the executing code. The process id's of the RTerm process we get from Process Explorer, and in Figure 5 below you see the process id's in the outlined column furthest to the right:

Figure 5:RTerm ProcessId's

Figure 5 tells us how we can see the id's of the RTerm processes, but how can we see the process id under which the code executes? It's not like we have @@SPID in the external engine. Fortunately R has a function to get the process id of the process in which the code is executing: Sys.getpid(). So if we change the code to something like in Code Snippet 2, we should be able to see the process id, and then be able to compare what we see from the RTerm processes:

Process ID

EXECsp_execute_external_script
@language=N'R',
@script=N'
                    pid <- Sys.getpid()
                    data<-InputDataSet
                    data$pid <- pid
                    OutputDataSet<-data;
                    Sys.sleep(120);',
@input_data_1=N'SELECT 42'
WITHRESULTSETS(([TheAnswer]intnotnull,ProcessIDint));

Code Snippet 2:Get the Process Id

Notice how we create and add a new column, pid, to the R data-frame, data, by: data$pid (the names pid and datacould be anything). Now, the way we will do this is to, in the same session:

Execute the code
Capture the process id's of the created RTerm processes, before the code has finished executing.
Look at the result from executing the code in Code Snippet 2 and compare the process id which is part of the result with the process id's we captured from the RTerm processes.

When we have done the steps above we repeat it a second time. If my assumption is correct that during a subsequent execution, a process will be used from the pool that was created at first execution; then the process id that comes back from the result of the second execution, should be found in the process id's that were captured during the first execution.

NOTE: It is important that the second run of the code is done immediately after the first. If not, some of the pooled processes may have been torn down.

Let's do this:

If you haven't deleted the PROCESS_POOL_SQLSATELLITE_GROWTH setting from the config file, go ahead and do that.
Restart the launchpad service.
Navigate to the Launchpad.exe process in Process Explorer.
Execute the code in Code Snippet 2.
While the code is executing capture the process id's from the RTerm.exe processes.

The capture of the first execution is shown in Figure 6 below:

Figure 6:Process Id's from First Run

The result from the code came back with a process id of 16956. If you were to look at the processes directly after the result came back, you would see 5 processes, as process 16956 (the executing process) has been torn down. Now execute the code a second time. The captured processes are now like so:

Figure 7:Process Id's from Second Run

In Figure 7 we indeed see that process 16956 is not there any more. When the result comes back the process id is 19028 as in Figure 8:

Figure 8:Result from 2:nd Run

So, looking back at Figure 6, we see how process id 19028 is part of the processes initially created, so it seems that the assumptions about how processes are used are correct.

But wait a second when we look at Figure 8, we see a new process id - 19192, and if we were to look at the processes right after the code has finished running, it would look something like so:

Figure 9:After 2:nd Run

In Figure 9 we see that the process that we executed under in the second run is gone as expected, but we have a new process running - 19192. So what happens is that, in parallel with R executing the code, the launchpad service is spinning up a new process.

The theory that for executions by the same user and SPID - the launchpad service uses processes from the pool seems to be correct.

Concurrent Executions Different Sessions

To see what happens for concurrent executions, by the same user but from different SPID's, we'll do it in almost the same way as above. Start with restarting the launchpad service, so we don't have any "hangers on-ers" from previous runs. We copy the code in Code Snippet 2 to a new query window in SQL Server Management Studio, (this ensures a new SPID), and then we:

Execute the code in query window 1.
Capture the RTerm process id's.
Execute the code in query window 2, while the code in query window 1 still executes.
Capture the RTerm process id's.

After both queries have finished executing, you will see that executing concurrently from the same user but different sessions will behave the same was as executing multiple times from the same session:

A process will be picked up from the pool and the code will execute under that session.
The launchpad service creates a new process, and adds it to the pool.
When the code has finished executing, the process which it executed under is being torn down,

So the theory holds true here as well.

Concurrent Executions Different Sessions Different Users

So what happens then if there are multiple users executing code concurrently? In this scenario nothing is different from when a single user executes for the first time:

The second user will be mapped to another user account,
The launchpad service creates it's normal 5 processes, plus one.
The code is executed.

Figure 10:Two Users Executing Concurrently

Figure 10 shows what it looks like in Process Explorer when two different users executes concurrently. I either of these users would then execute another statement, it would be exactly as above where we looked at the single user scenario.

NOTE: When looking in Process Explorer at the RTerm processes you can actually see what process is active. The active process has a value in the CPU column.

Summary

In this blog-post I set out to prove/disprove two things:

That the setting PROCESS_POOL_SQLSATELLITE_GROWTH can be used to control the number of processes being created by the launchpad service.
Processes that are added to the pool is being picked up and used for subsequent executions for a user.

What we saw was:

When PROCESS_POOL_SQLSATELLITE_GROWTH is absent from the rlauncher.config file, the launchpad service creates 5 processes plus 1 by default, and after execution the executing process is torn down. The others are added to the pool.
When a value has been set for PROCESS_POOL_SQLSATELLITE_GROWTH, the launchpad service creates that number of processes plus one, and after execution the executing process is torn down. The others are added to the pool.
When a user executes subsequent requests, or concurrent requests from different sessions, processes are picked up and used from the pool.
- The launchpad service simultaneously creates a new process.

So, thanks Bob for sending me the mail with the link the the post. That made me look deeper into how this "stuff" works! In that email Bob also sent some code, which will be used as a topic for another internals blog-post. That post will be about parallelism and the RTerm processes.

If you have comments, questions etc., please comment on this post or ping me.

↧

Interesting Stuff - Week 16

April 23, 2017, 11:22 am

≫ Next: Interesting Stuff - Week 17

≪ Previous: Microsoft SQL Server R Services - Internals IV

Throughout the week, I read a lot of blog-posts, articles, etc., that has to do with things that interest me

data science
data in general
distributed computing
SQL Server
transactions (both db as well as non db)
and other "stuff"

This is the "roundup" of the posts that has been most interesting to me, this week.

Wow! Wow is all I can say! Sure, of course I knew that Microsoft Data Amp would take place this week, but I had no idea that there would be so much interesting stuff coming out of it!! So instead of pointing to each and every really cool announcement (RCA), I'll try to keep it contained somewhat, and point to the Channel 9 site for all the videos, plus a couple of the Really, Really Cool Announcements (RRCA).

Microsoft Data Amp

This section will be SQL Server heavy, but still a x-section of various interesting things from Microsoft Data Amp.

Microsoft Data Amp 2017. The central site for all videos from Microsoft Data Amp. If you only can watch two videos, please, please do yourself a favor and watch the keynotes: Scott "Red-Shirt" Guthrie, and Joseph Siro.
SQL Server 2017: Building applications using graph data. SQL Server 2017 (the "artist" formerly known as SQL v.Next) has support for graph data. Shreya Verma presents about what you can do with it.
SQL Server 2017: Advanced Analytics with Python. Oh, ah - SQL Server 2017 supports Python in addition to R for analysis tasks!! See more below.
What's new in R Server 9.1 and SQL Server R Services. So let's not let Python run away with all the "glory" in SQL Server 2017. The release of R Server 9.1 in SQL Server R Services does have some cool stuff up its sleeve!
SQL Server 2017: Adaptive Query Processing. Let's not forget my roots; T-SQL, OLTP, the "black art" of query processing. Microsoft in SQL Server 2017 introduces a new family of adaptive query processing improvements that will enhance the performance of workloads that have historically been difficult to tune through classic methodologies. How cool is that?!!!

OK, so let us somewhat return to the "normal" program.

.NET

C# Futures: Nullable Reference Types. Well, well, well - it looks like Microsoft is moving towards an F# / Haskell model where types, reference or value, are non-nullable by default. It will be very interesting to see how the community reacts to this.

SQL Server

Admittedly , there was a lot of SQL Server new (almost all?) coming from Microsoft Data Amp, but there is still some "other" noteworthy SQL Server topics.

Saving input and output with sp_execute_external_script using temporal table and file table (part #2). In last week's roundup, I mentioned how Tomaz tries to figure out how to see what statements have been executed from sp_execute_external_script. This is part two of the "saga"
SQL Agent and the hairiest Dateadd in town. Another "epic" drill down in the bowels of SQL Server by Ewald, this time trying to find about weird DATEADD behavior.
SQL Server Mysteries: The Case of the Not 100% RESTORE…. Awesome "spelunking" by Bob Ward, where he tries to figure out why RESTORE doesn't work as expected!
SQL Server 2017 Community Technology Preview 2.0 now available. READ ALL ABOUT IT, and then go and download the bits. SQL Server 2017 CTP 2.0 is available for download.

Data Science

Data Preparation Pipelines: Strategy, Options and Tools. Article from InfoQ about how data preparation is an important aspect of data processing and analytics use cases, and how business analysts and data scientists spend about 80% of their time gathering and preparing the data rather than analyzing it or developing machine learning models.
Microsoft R Server 9.1 now available. My friends over at Revolution Analytics, also watched (or attended) Microsoft Data Amp and blogs about how Microsoft R Server 9.1 is now available, and some of the new interesting features in it.
Deep Learning on the New Ubuntu-Based Data Science Virtual Machine for Linux. Microsoft’s Data Science Virtual Machine (DSVM) is a family of popular VM images published on the Azure marketplace with a broad choice of machine learning and data science tools. Microsoft is extending it with the introduction of a brand-new offering in this family – the Data Science Virtual Machine for Linux, based on Ubuntu 16.04LTS – that also includes a comprehensive set of popular deep learning frameworks.
Deep Learning with Caffe2 on the Azure Data Science Virtual Machine. Caffe has been one such early and popular open source deep learning framework. It has now been rewritten and greatly improved in the latest iteration, named Caffe2. Microsoft and Facebook have worked together to bring Caffe2 to Azure on the Data Science Virtual Machine (DSVM) which can run on either GPU or CPU based virtual machines on the cloud.
Delivering AI with data: the next generation of Microsoft’s data platform. An overview of Microsoft's offerings in he data and AI space.

Shameless Self Promotion

Some "plugs" about a couple of recent blog-posts by yours truly.

SQL Server 2017 - Python Executing Inside SQL Server. Straight after the Microsoft Data Amp event I downloaded SQL Server 2017 and started ~~playing around~~, (erm, I mean) researching. This post is a "Hello World" Python running in SQL Server 2017. There will be more posts coming about Python in SQL Server 2017.
Microsoft SQL Server R Services - Internals IV. The fourth post in the Microsoft SQL Server R Services - Internals "saga". In this episode, the fearless hero (me) looks more into process creation, process pools and other cool stuff!

That's all for this week. I hope you enjoy what I did put together. If you have ideas for what to cover, please comment on this post or ping me.

↧

Interesting Stuff - Week 17

April 30, 2017, 11:24 am

≫ Next: Microsoft SQL Server R Services - Internals V

≪ Previous: Interesting Stuff - Week 16

Throughout the week, I read a lot of blog-posts, articles, etc., that has to do with things that interest me

data science
data in general
distributed computing
SQL Server
transactions (both db as well as non db)
and other "stuff"

This is the "roundup" of the posts that has been most interesting to me, this week.

Streaming

Nikita Ivanov on Apache Ignite In-Memory Computing Platform. You can hardly turn around without "bumping" into a platform offering in-memory computing. Apcahe Ignite is a newcomer to the mix, and - in an InfoQ interview - Nikita Ivanov talks about what Apache Ignite is. To me it is interesting as it supports both both key-value persistence as well as streaming and complex-event processing.

SQL Server

How to find query plan choice regressions with SQL Server 2017 CTP2. A blog post by Jovan Popovic from Microsoft about how SQL Server 2017 introduces functionality to allow you to easily identify performance regressions in SQL queries. I know some DBA's at Derivco who'd sell their first-born for this.
SQL Server community-driven enhancements in SQL Server 2017. A post by the SQL Server engineering a.k.a TIGER team, how a lot of new functionality in SQL Server 2017 has been introduced due to ideas/requests from the community. Very cool!!
How are default column values stored?. Paul from SQLskills "spelunks" about how default column values are stored.

Data Science

Data Preparation for Data Science: A Field Guide. An InfoQ presentation about a utility written with Apache Spark to automate data preparation, discovering missing values, values with skewed distributions and discovering likely errors within data. This could come in very handy for us.
Using Microsoft’s Deep Learning Toolkit with Spark on Azure HDInsight Clusters. How to do distributed deep learning over big datasets on Azure HSInsight Spark with Microsoft Cognitive Toolkit. This is very, very interesting!!
R 3.4.0 now available. The guys at Revolution Analytics points out that R 3.4.0 is available, and some of the new functionality in the release. Go and get it before it is sold out!
Bringing IoT to sports analytics. the morning paper is back after vacation! This is about sports analytics and how IoT devices can help analyzing various things, and potentially replacing very, very expensive high-quality cameras.
Leveraging Microsoft R and in database analytics of SQL Server with R Services through Alteryx Designer. In the roundup for week 12 I wrote about how Revolution Analytics mentioned this visual designer for R supporting SQL Server R Services: Alteryx. In the post I link to in this roundup, the Microsoft R Product Team tries out the designer against SQL Server R Services. It looks quite a lot like Azure ML. I so need to try it out!
Microsoft Puts AI Where the Data Is. A very nice article about how Microsoft tries to pu the Data Science / AI where the data is, in the database.
Performance differences between RevoScaleR, ColumnStore Table and In-Memory OLTP Table. A comparison, by Tomaz, of performance between various data stores and applying data science:ish functions against the data.
Does Data Science Replace BI?. Buck Woody asks the question whether BI is being replaced by Data Science.

SQL Server R Services

Just an update about where I am with my series about SQL Server R Services. I am busy working on Internals - V, and I had hoped to have it out by now, but there are some things I still want to investigate further. I hope I will be able to publish it early this coming week. In the meantime you can always go back and read the previous posts :):

That's all for this week. I hope you enjoy what I did put together. If you have ideas for what to cover, please comment on this post or ping me.

↧