An easy way to make SQL queries from Python to the SDSS database¶
Author: Eduardo Martín Calleja
In this entry we will see a very simple method of executing SQL queries on the Sloan Digital Sky Survey (SDSS) database. In this way we can get a lot of data about all kinds of celestial objects and load them into Python data structures, like Pandas dataframes, for later process or plotting.
This post has been written entirely using the IPython Notebook. I will also use the Python module "mechanize" to surf the web and run interactively HTML forms. I will explain in detail each of the steps, and at the end I will summarize to avoid being lost in the details and show the usability of the proposed method.
Imports and references¶
For an introduction to the execution of SQL queries on the SDSS database I can not think of a better resource than their own tutorial, together with its complement of examples: Sample SQL Queries
To view the tables and views that exist in the database, and what information is available in each of them, you can use: Schema Browser
And to run interactively a SQL on the SDSS database, you can access the page: SQL Search
A nice notebook using the method in this post to query the SDSS database: here
%matplotlib inline
from __future__ import division
import numpy as np
import pandas as pd
import mechanize
from StringIO import StringIO # To read a string like a file
# This IPython magic generates a table with version information
#https://github.com/jrjohansson/version_information
%load_ext version_information
%version_information numpy, pandas, StringIO
# URL to the SDSS SQL Search DR10
url = "http://skyserver.sdss3.org/dr10/en/tools/search/sql.aspx"
SQL preparation¶
Before we begin to interact from Python with the web page of the SDSS that allows us to send a SQL statement to the database, I will prepare a test SQL query on a string variable. In this SQL we will find 10 objects (this is a test!) of type 6 = 'STAR', with clean photometry and blue color in a certain region of sky. But, be careful to avoid any comments (those preceded by a double dash -) when you create the string.
s = 'SELECT TOP 10 \
objID, ra, dec, modelMag_u, modelMag_g \
FROM \
PhotoPrimary \
WHERE \
ra BETWEEN 140 and 141 \
AND dec BETWEEN 20 and 21 \
AND type = 6 \
AND clean = 1 \
AND modelMag_u - modelMag_g < 0.5'
Web surfing with the Python mechanize module¶
The first step in using mechanize will be to create a Browser-like object to be able to navigate using its methods
br = mechanize.Browser()
Then we must open a session using the url defined above, pointing to the SDSS web page that allows us to make the SQL queries:
resp = br.open(url)
resp.info()
When you want to interact with a web page, you will be interested to know the HTML forms contained in it. An HTML form is a section of the document between the tags: FORM and /FORM
A HTML form contains a series of special objects called controls such as checkboxes, radio buttons, menus, etc. and labels of these objects. The user interacts with the page by modifying a control, for example by selecting an option, introducing a text in a field, etc. and sending this modified form back to the server.
Each HTML form on the page has a name, although this can in some cases be empty. To get a list of the names of the forms in the page we can write:
for f in br.forms():
print f.name
That is, in this case there is only one form on the page, named "sql"
At the same time, each form has a list of controls that also have a name, which can also be left blank. To list the forms on the page, along with their controls, and each control type, we can do the following:
for f in br.forms():
print f.name
for c in f.controls:
print '\t',c.name, '\t', c.type
We will focus on the "cmd" control which is the text area in which we write our SQL, and the "format" control, which, as you can see on the web page, is used to control the type of output desired: HTML, XML, CSV, etc.. To access these controls you must previously select the form to which they belong:
br.select_form(name="sql")
Then we will modify the control 'cmd' to enter our SQL, and the 'format' control, to select the output in csv format.
br['cmd'] = s # This is the string with the sql query
br['format']=['csv'] # data output format
response = br.submit()
We can get a string with the contents of the answer using the get_data() method:
print response.get_data()
But attention!, The submit() method closes the session, so, to send another SQL query you must first repeat the br.open() and br.select() calls.
Then, and in order to be able to process the data more easily, the most advisable could be to generate a Python Pandas dataframe. We can see that the first line should be discarded, while the second row contains the names of the columns, so we will keep it in the dataframe:
file_like = StringIO(response.get_data())
df =pd.read_csv(file_like, skiprows = 1) # skip the first row
df
From here we could do such things like rename the columns giving them names more to our liking, and calculate a new column as the difference of the columns u and g, which will indicate the color of the star (more on that in another post).
df.columns = ['objID','ra','dec','u','g']
df['u-g'] = df['u']-df['g']
df
Summary¶
Once we have seen the rationale of the use of the mechanize module and the creation of a Pandas dataframe, we can define a function to streamline the process for new SQL queries:
def SDSS_select(sql):
'''input: string with a valid SQL query
output: a Pandas dataframe
'''
br.open(url)
br.select_form(name="sql")
br['cmd'] = sql
br['format']=['csv']
response = br.submit()
file_like = StringIO(response.get_data())
return pd.read_csv(file_like, skiprows=1)
The steps are as follows:
- The following instructions will be executed only the first time:
# URL a SQL Search DR10
url = "http://skyserver.sdss3.org/dr10/en/tools/search/sql.aspx"
- We prepare a string with our SQL (you can test it first on the web page, as there's no exception handling in the above function)
sql = 'SELECT TOP 10 \
objID, ra, dec, modelMag_u,modelMag_g,modelMag_r,modelMag_i,modelMag_z \
FROM Star \
WHERE ra BETWEEN 150 and 152 AND dec BETWEEN 30 and 31 AND clean = 1'
- We make a call to the function, obtaining a Pandas dataframe in return
df = SDSS_select(sql)
And we already have our dataframe ready!
df
Here ends this post, but the access method to the SDSS database we have seen here, will be used systematically in future posts, to analyze with Python, based on real data, various properties of celestial objects.
No hay comentarios:
Publicar un comentario