A common way of creating data files is via queries to a database. This article explains the key steps in doing this, when to use database queries, and also why alternative approaches are often preferable.
Key steps for importing via SQL queries
Importing data via a SQL query involves three key steps:
- Obtain access to the database
- Create the query
-
Extract into a format that your data analysis software can read
Obtain access to the database
Accessing a database requires that:
- The database is a type that can be queried by the analysis software. Typically this will mean it is a database that allows the user to extract data via SQL queries.
- There must be a mechanism to connect to the database. For example, if using cloud-based analysis software, this requires a public (over the internet) connection to the database server. This may require reconfiguration of firewalls by database administrators.
- You know
- The type of database (brand, e.g. Microsoft SQL Server)
- Server name
- Database name
- User name
- Password
Create a query
Relational databases store data in tables. Where the table is appropriate for data analysis (i.e., has appropriate cases and variables), data files can be created with relatively straightforward queries. Most often, such queries are written in SQL. A simple SQL query for extracting a data file from a table called Users is:
select UserID, Name, Email, Created, QuantityPurchased
from Users
WHERE UserId <> ''
order by Created
Often, however, tables in relational databases are created with data storage and management considerations in mind, rather than data analysis. Consequently, the SQL queries used to extract data tend to be quite complex and require the person writing the query to:
- Have detailed knowledge of the database schema (the description of what information is stored where and how it interrelates).
- Have sufficient skills to create a query that will generate a flat data file, by joining multiple tables, aggregating data, sorting data, and deleting data. This generally requires programming experience (it's not something that a novice can hack together).
select
-- Start with the main IDs.
c.CompanyID, i.InvoiceID, c.BillCountry,c.BillCompanyName, c.Name, c.Notes, c.AdminNotes, c.BeforeRenewal, c.NumberOfProjects,
-- Calculate the "organization name" in the same manner which is used for
-- outstanding invoice reminders to staff.
isnull(c.BillCompanyName, c.Name) + (case
when c.BillCompanyName is not null and c.Name is not null
and c.BillCompanyName <> c.Name
then ' (company name: ' + c.Name + ')'
else ''
end) as Organization,
-- These fields come straight from the Invoices table.
i.Issued, i.Due, i.Currency, i.Total, i.Tax, i.Cancelled, i.OfficeId,
-- "CreatedBy" is only populated if the invoice was created by a site
-- administrator who was logged in as such at the time.
isnull(i.CreatedBy, '') as CreatedBy,
-- Turn the InvoiceLines for each Invoice into a comma-separated list
-- which includes the quantity (fancy!)
stuff((
select ', ' + Description + ' (x' + convert(varchar(23), Quantity) + ')'
from InvoiceLines
where InvoiceID = i.InvoiceID
order by Description
for xml path('')
), 1, 2, '') as Lines,
-- The earliest start date of an annual licence
(
select min(Start)
from InvoiceLines
where InvoiceID = i.InvoiceID
and Finish is not null
) as StartDate,
-- The latest end date of an annual licence
(
select max(Finish)
from InvoiceLines
where InvoiceID = i.InvoiceID
and Finish is not null
) as EndDate,
-- The number of Professional licences (ignoring server time licences)
(
select sum(Quantity)
from InvoiceLines
where InvoiceID = i.InvoiceID
and Finish is not null and AllowEditingForTicks > 0 and Total > 0
) as NumberProfessionalLicenses,
-- Total payments against this invoice, or zero if there are none.
(
select sum(Total)
from InvoiceLines
where InvoiceID = i.InvoiceID
and Finish is not null
) as TotalAnnualLicenceValue,
-- Total payments against this invoice, or zero if there are none.
isnull((
select sum(Amount)
from Payments
where InvoiceID = i.InvoiceID
), 0) as TotalPayments,
-- Across the Company, work out what percentage of available edit-mode
-- time has been used, as a percentage.
isnull(convert(float, (
-- Used edit mode "ticks" - there are 10 000 000 ticks per second.
select sum(s.EditedForTicks)
from Licences s
join InvoiceLines n2 on s.InvoiceLineID = n2.InvoiceLineID
join Invoices i2 on n2.InvoiceID = i2.InvoiceID
where i2.CompanyID = i.CompanyID
-- As Licences can be cancelled separately to Invoices, we need
-- to check the "Cancelled" field here too.
and s.Cancelled is null
)), 0.0) / isnull(nullif(convert(float, (
-- Available edit mode "ticks" - there are 10 000 000 ticks per second.
select sum(AllowEditingForTicks)
from InvoiceLines n3
join Invoices i3 on n3.InvoiceID = i3.InvoiceID
where i3.CompanyID = i.CompanyID
)), 0.0), 1.0) * 100.0 as CompanyEditUsedPC
from Invoices i
join Companies c on i.CompanyID = c.CompanyID
-- Don't include cancelled Invoices.
where i.Cancelled is null
order by Issued
Extract into a format that your data analysis software can read
SQL is not sufficiently rich to allow a user to create a metadata-rich file format (e.g., you can't create a QPack or SPSS Data File via a SQL query). Instead, it's typical to extract the data in a flat file format. Usually, a CSV file is best. In some software, this is the default (e.g., Q, Displayr). In other software, this is an option (E.g., SQL Server).
When to use database queries
Using database queries to obtain data is a sensible approach when either:
- There is no other way.
- The data contains limited metadata and needs to be regularly updated.
The problems with extracting survey data via database queries
When people conceive of importing data directly from a database, they often are expecting to have the type of experience that occurs with an integration. However, as discussed in the previous section, CSV files are the norm, but these files do not have good metadata.
An additional problem with database queries is that when the survey changes for some reason (e.g., the addition of a new brand to a question), this may require that the database query is rewritten and may make it hard to reconcile the data across multiple versions of the database query.
Comments
0 comments
Please sign in to leave a comment.