An easy way to make SQL queries from Python to the SDSS database¶

Author: Eduardo Martín Calleja

In this entry we will see a very simple method of executing SQL queries on the Sloan Digital Sky Survey (SDSS) database. In this way we can get a lot of data about all kinds of celestial objects and load them into Python data structures, like Pandas dataframes, for later process or plotting.

This post has been written entirely using the IPython Notebook. I will also use the Python module "mechanize" to surf the web and run interactively HTML forms. I will explain in detail each of the steps, and at the end I will summarize to avoid being lost in the details and show the usability of the proposed method.

Imports and references¶

For an introduction to the execution of SQL queries on the SDSS database I can not think of a better resource than their own tutorial, together with its complement of examples: Sample SQL Queries
To view the tables and views that exist in the database, and what information is available in each of them, you can use: Schema Browser
And to run interactively a SQL on the SDSS database, you can access the page: SQL Search
The mechanize home page
A nice notebook using the method in this post to query the SDSS database: here

In [1]:

%matplotlib inline

from __future__ import division

import numpy as np
import pandas as pd
import mechanize
from StringIO import StringIO # To read a string like a file

# This IPython magic generates a table with version information
#https://github.com/jrjohansson/version_information
%load_ext version_information
%version_information numpy, pandas, StringIO

Out[1]:

Software	Version
Python	2.7.9 64bit [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
IPython	2.3.1
OS	Linux 3.13.0 45 generic x86_64 with debian jessie sid
numpy	1.9.1
pandas	0.15.2
StringIO	StringIO
Sat Feb 21 12:08:20 2015 CET

In [2]:

# URL to the SDSS SQL Search DR10
url = "http://skyserver.sdss3.org/dr10/en/tools/search/sql.aspx"

SQL preparation¶

Before we begin to interact from Python with the web page of the SDSS that allows us to send a SQL statement to the database, I will prepare a test SQL query on a string variable. In this SQL we will find 10 objects (this is a test!) of type 6 = 'STAR', with clean photometry and blue color in a certain region of sky. But, be careful to avoid any comments (those preceded by a double dash -) when you create the string.

In [3]:

s = 'SELECT TOP 10                         \
    objID, ra, dec, modelMag_u, modelMag_g \
FROM                                       \
    PhotoPrimary                           \
WHERE                                      \
    ra BETWEEN 140 and 141                 \
AND dec BETWEEN 20 and 21                  \
AND type = 6                               \
AND clean = 1                              \
AND modelMag_u - modelMag_g < 0.5'

Web surfing with the Python mechanize module¶

The first step in using mechanize will be to create a Browser-like object to be able to navigate using its methods

In [4]:

br = mechanize.Browser()

Then we must open a session using the url defined above, pointing to the SDSS web page that allows us to make the SQL queries:

In [5]:

resp = br.open(url)

In [6]:

resp.info()

Out[6]:

<httplib.HTTPMessage instance at 0x7f4ba6cdd950>

When you want to interact with a web page, you will be interested to know the HTML forms contained in it. An HTML form is a section of the document between the tags: FORM and /FORM

A HTML form contains a series of special objects called controls such as checkboxes, radio buttons, menus, etc. and labels of these objects. The user interacts with the page by modifying a control, for example by selecting an option, introducing a text in a field, etc. and sending this modified form back to the server.

Each HTML form on the page has a name, although this can in some cases be empty. To get a list of the names of the forms in the page we can write:

In [7]:

for f in br.forms():
    print f.name

sql

That is, in this case there is only one form on the page, named "sql"

At the same time, each form has a list of controls that also have a name, which can also be left blank. To list the forms on the page, along with their controls, and each control type, we can do the following:

In [8]:

for f in br.forms():
    print f.name
    for c in f.controls:
        print '\t',c.name, '\t', c.type

sql
 clear  button
 cmd  textarea
 None  submit
 syntax  checkbox
 format  radio
 reset  reset

We will focus on the "cmd" control which is the text area in which we write our SQL, and the "format" control, which, as you can see on the web page, is used to control the type of output desired: HTML, XML, CSV, etc.. To access these controls you must previously select the form to which they belong:

In [9]:

br.select_form(name="sql")

Then we will modify the control 'cmd' to enter our SQL, and the 'format' control, to select the output in csv format.

In [10]:

br['cmd'] = s  # This is the string with the sql query
br['format']=['csv'] # data output format
response = br.submit()

We can get a string with the contents of the answer using the get_data() method:

In [11]:

print response.get_data()

#Table1
objID,ra,dec,modelMag_u,modelMag_g
1237667293189833237,140.000264887516,20.3528168302492,23.27699,23.07312
1237667293189833455,140.001621936922,20.4298007848266,23.79515,24.28631
1237667293189832757,140.003651871714,20.2546305174986,24.7036,24.82169
1237667430093553674,140.004909940259,20.1719995965921,24.04017,24.81654
1237667430093488938,140.009035667549,20.0174593142685,23.4923,23.47844
1237667430093488288,140.010686497758,20.0781480229235,19.89794,19.64356
1237667209974448712,140.011074894392,20.9763084088396,22.70029,22.85815
1237667430093554008,140.012379856521,20.1469377004989,23.12733,22.86493
1237667209974448933,140.013140656439,20.9086701836405,23.96772,24.16748
1237667430093488949,140.013238349062,20.0312906376057,24.73975,25.34514

But attention!, The submit() method closes the session, so, to send another SQL query you must first repeat the br.open() and br.select() calls.

Then, and in order to be able to process the data more easily, the most advisable could be to generate a Python Pandas dataframe. We can see that the first line should be discarded, while the second row contains the names of the columns, so we will keep it in the dataframe:

In [12]:

file_like = StringIO(response.get_data())
df =pd.read_csv(file_like, skiprows = 1) # skip the first row
df

Out[12]:

	objID	ra	dec	modelMag_u	modelMag_g
0	1237667293189833237	140.000265	20.352817	23.27699	23.07312
1	1237667293189833455	140.001622	20.429801	23.79515	24.28631
2	1237667293189832757	140.003652	20.254631	24.70360	24.82169
3	1237667430093553674	140.004910	20.172000	24.04017	24.81654
4	1237667430093488938	140.009036	20.017459	23.49230	23.47844
5	1237667430093488288	140.010686	20.078148	19.89794	19.64356
6	1237667209974448712	140.011075	20.976308	22.70029	22.85815
7	1237667430093554008	140.012380	20.146938	23.12733	22.86493
8	1237667209974448933	140.013141	20.908670	23.96772	24.16748
9	1237667430093488949	140.013238	20.031291	24.73975	25.34514

From here we could do such things like rename the columns giving them names more to our liking, and calculate a new column as the difference of the columns u and g, which will indicate the color of the star (more on that in another post).

In [13]:

df.columns = ['objID','ra','dec','u','g']
df['u-g'] = df['u']-df['g']
df

Out[13]:

	objID	ra	dec	u	g	u-g
0	1237667293189833237	140.000265	20.352817	23.27699	23.07312	0.20387
1	1237667293189833455	140.001622	20.429801	23.79515	24.28631	-0.49116
2	1237667293189832757	140.003652	20.254631	24.70360	24.82169	-0.11809
3	1237667430093553674	140.004910	20.172000	24.04017	24.81654	-0.77637
4	1237667430093488938	140.009036	20.017459	23.49230	23.47844	0.01386
5	1237667430093488288	140.010686	20.078148	19.89794	19.64356	0.25438
6	1237667209974448712	140.011075	20.976308	22.70029	22.85815	-0.15786
7	1237667430093554008	140.012380	20.146938	23.12733	22.86493	0.26240
8	1237667209974448933	140.013141	20.908670	23.96772	24.16748	-0.19976
9	1237667430093488949	140.013238	20.031291	24.73975	25.34514	-0.60539

Summary¶

Once we have seen the rationale of the use of the mechanize module and the creation of a Pandas dataframe, we can define a function to streamline the process for new SQL queries:

In [14]:

def SDSS_select(sql):
    '''input: string with a valid SQL query
    output: a Pandas dataframe
    '''
    br.open(url)
    br.select_form(name="sql")
    br['cmd'] = sql
    br['format']=['csv']
    response = br.submit()
    file_like = StringIO(response.get_data())
    return pd.read_csv(file_like,  skiprows=1)

The steps are as follows:

The following instructions will be executed only the first time:

In [15]:

# URL a SQL Search DR10
url = "http://skyserver.sdss3.org/dr10/en/tools/search/sql.aspx"

We prepare a string with our SQL (you can test it first on the web page, as there's no exception handling in the above function)

In [16]:

sql = 'SELECT TOP 10    \
objID, ra, dec, modelMag_u,modelMag_g,modelMag_r,modelMag_i,modelMag_z \
FROM  Star               \
WHERE ra BETWEEN 150 and 152 AND dec BETWEEN 30 and 31 AND clean = 1'

We make a call to the function, obtaining a Pandas dataframe in return

In [17]:

df = SDSS_select(sql)

And we already have our dataframe ready!

In [18]:

df

Out[18]:

	objID	ra	dec	modelMag_u	modelMag_g	modelMag_r	modelMag_i	modelMag_z
0	1237664869216289142	150.000355	30.732103	23.06350	22.79704	21.60797	21.22958	21.32821
1	1237665099003593049	150.000483	30.591703	22.71077	22.14142	21.63762	21.84505	21.44720
2	1237665098466656334	150.000711	30.105872	20.72021	18.12753	16.79675	16.08280	15.69484
3	1237665098466656330	150.000754	30.097463	22.66031	19.98154	18.61721	17.61466	17.07129
4	1237664869216289439	150.000985	30.748886	25.00283	23.08974	22.20838	21.99796	21.30983
5	1237664869216288905	150.001134	30.829336	21.41318	20.29182	19.80290	19.55658	19.49337
6	1237665129067840138	150.001698	30.483674	23.61431	22.05697	20.63076	20.13519	19.81597
7	1237665098466657196	150.002180	30.248386	25.05051	23.19651	21.70999	20.04427	19.07293
8	1237665129067840349	150.002215	30.429241	23.51003	24.00885	21.71184	21.25962	20.59406
9	1237665098466656342	150.002250	30.174147	20.84485	18.57332	17.56038	17.15493	16.93803

Here ends this post, but the access method to the SDSS database we have seen here, will be used systematically in future posts, to analyze with Python, based on real data, various properties of celestial objects.

Balbuceos astropythonicos

Etiquetas