Pandas

Pandas is a an open source library providing high-performance, easy-to-use data structures and data analysis tools. Pandas is particularly suited to the analysis of tabular data, i.e. data that can can go into a table. In other words, if you can imagine the data in an Excel spreadsheet, then Pandas is the tool for the job.

A recent analysis of questions from Stack Overflow showed that python is the fastest growing and most widely used programming language in the world (in developed countries).

python growth

A follow-up analysis showed that this growth is driven by the data science packages such as numpy, matplotlib, and especially pandas.

pandas growth

The exponential growth of pandas is due to the fact that it just works. It saves you time and helps you do science more efficiently and effictively.

Pandas capabilities (from the Pandas website):

  • A fast and efficient DataFrame object for data manipulation with integrated indexing;
  • Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
  • Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
  • Flexible reshaping and pivoting of data sets;
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
  • Columns can be inserted and deleted from data structures for size mutability;
  • Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
  • High performance merging and joining of data sets;
  • Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
  • Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
  • Highly optimized for performance, with critical code paths written in Cython or C.
  • Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

In this lecture, we will go over the basic capabilities of Pandas. It is a very deep library, and you will need to dig into the documentation for more advanced usage.

Pandas was created by Wes McKinney. Many of the examples here are drawn from Wes McKinney's book Python for Data Analysis, which includes a github repo of code samples.

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

Pandas Data Structures: Series

A Series represents a one-dimensional array of data. The main difference between a Series and numpy array is that a Series has an index. The index contains the labels that we use to access the data.

There are many ways to create a Series. We will just show a few.

In [83]:
names = ['Ryan', 'Chiara', 'Johnny']
values = [35, 36, 1.8]
ages = pd.Series(values, index=names)
ages
Out[83]:
Ryan      35.0
Chiara    36.0
Johnny     1.8
dtype: float64

Series have built in plotting methods.

In [84]:
ages.plot(kind='bar')
Out[84]:

Arithmetic operations and most numpy function can be applied to Series. An important point is that the Series keep their index during such operations.

In [90]:
np.log(ages) / ages**2
Out[90]:
Ryan      0.002902
Chiara    0.002765
Johnny    0.181416
dtype: float64

We can access the underlying index object if we need to:

In [93]:
ages.index
Out[93]:
Index(['Ryan', 'Chiara', 'Johnny'], dtype='object')

We can get values back out using the index via the .loc attribute

In [95]:
ages.loc['Johnny']
Out[95]:
1.8

Or by raw position using .iloc

In [96]:
ages.iloc[2]
Out[96]:
1.8

If we need to, we can always get the raw data back out as well

In [97]:
ages.values
Out[97]:
array([ 35. ,  36. ,   1.8])
In [98]:
ages.index
Out[98]:
Index(['Ryan', 'Chiara', 'Johnny'], dtype='object')

Pandas Data Structures: DataFrame

There is a lot more to Series, but they are limit to a single "column". A more useful Pandas data structure is the DataFrame. A DataFrame is basically a bunch of series that share the same index. It's a lot like a table in a spreadsheet.

Below we create a DataFrame.

In [122]:
# first we create a dictionary
data = {'age': [35, 36, 1.8],
        'height': [180, 155, 83],
        'weight': [72.5, np.nan, 11.3]}
df = pd.DataFrame(data, index=['Ryan', 'Chiara', 'Johnny'])
df
Out[122]:
age height weight
Ryan 35.0 180 72.5
Chiara 36.0 155 NaN
Johnny 1.8 83 11.3

Pandas handles missing data very elegantly, keeping track of it through all calculations.

In [101]:
df.info()

Index: 3 entries, Ryan to Johnny
Data columns (total 3 columns):
age       3 non-null float64
height    3 non-null int64
weight    2 non-null float64
dtypes: float64(2), int64(1)
memory usage: 96.0+ bytes

A wide range of statistical functions are available on both Series and DataFrames.

In [113]:
df.min()
Out[113]:
age        1.8
height    83.0
weight    11.3
dtype: float64
In [114]:
df.mean()
Out[114]:
age        24.266667
height    139.333333
weight     41.900000
dtype: float64
In [115]:
df.std()
Out[115]:
age       19.463128
height    50.362023
weight    43.274935
dtype: float64
In [102]:
df.describe()
Out[102]:
age height weight
count 3.000000 3.000000 2.000000
mean 24.266667 139.333333 41.900000
std 19.463128 50.362023 43.274935
min 1.800000 83.000000 11.300000
25% 18.400000 119.000000 26.600000
50% 35.000000 155.000000 41.900000
75% 35.500000 167.500000 57.200000
max 36.000000 180.000000 72.500000

We can get a single column as a Series using python's getitem syntax on the DataFrame object.

In [112]:
df['height']
Out[112]:
Ryan      180
Chiara    155
Johnny     83
Name: height, dtype: int64

...or using attribute syntax.

In [116]:
df.height
Out[116]:
Ryan      180
Chiara    155
Johnny     83
Name: height, dtype: int64

New columns can easily be added to DataFrames

In [123]:
df['density'] = df.weight / df.height
df
Out[123]:
age height weight density
Ryan 35.0 180 72.5 0.402778
Chiara 36.0 155 NaN NaN
Johnny 1.8 83 11.3 0.136145

Merging Data

Pandas supports a wide range of methods for merging different datasets. These are described extensively in the documentation. Here we just give a few examples.

In [125]:
education = pd.Series(['PhD', 'PhD', None, 'masters'],
                     index=['Ryan', 'Chiara', 'Johnny', 'Takaya'],
                     name='education')
# returns a new DataFrame
df.join(education)
Out[125]:
age height weight density education
Ryan 35.0 180 72.5 0.402778 PhD
Chiara 36.0 155 NaN NaN PhD
Johnny 1.8 83 11.3 0.136145 None
In [127]:
# returns a new DataFrame
df.join(education, how='right')
Out[127]:
age height weight density education
Ryan 35.0 180.0 72.5 0.402778 PhD
Chiara 36.0 155.0 NaN NaN PhD
Johnny 1.8 83.0 11.3 0.136145 None
Takaya NaN NaN NaN NaN masters
In [128]:
# returns a new DataFrame
df.reindex(['Ryan', 'Chiara', 'Johnny', 'Takaya', 'Kerry'])
Out[128]:
age height weight density
Ryan 35.0 180.0 72.5 0.402778
Chiara 36.0 155.0 NaN NaN
Johnny 1.8 83.0 11.3 0.136145
Takaya NaN NaN NaN NaN
Kerry NaN NaN NaN NaN

We can also index using a boolean series. This is very useful

In [131]:
adults = df[df.age > 18]
adults
Out[131]:
age height weight density
Ryan 35.0 180 72.5 0.402778
Chiara 36.0 155 NaN NaN
In [133]:
df['is_adult'] = df.age > 18
df
Out[133]:
age height weight density is_adult
Ryan 35.0 180 72.5 0.402778 True
Chiara 36.0 155 NaN NaN True
Johnny 1.8 83 11.3 0.136145 False

Plotting

DataFrames have all kinds of useful plotting built in.

In [108]:
df.plot(kind='scatter', x='age', y='height', grid=True)
Out[108]:
In [110]:
df.plot(kind='bar')
Out[110]:

Time Indexes

Indexes are very powerful. They are a big part of why Pandas is so useful. There are different indices for different types of data. Time Indexes are especially great!

In [134]:
two_years = pd.date_range(start='2014-01-01', end='2016-01-01', freq='D')
timeseries = pd.Series(np.sin(2 *np.pi *two_years.dayofyear / 365),
                       index=two_years)
timeseries.plot()
Out[134]:

We can use python's slicing notation inside .loc to select a date range.

In [137]:
timeseries.loc['2015-01-01':'2015-07-01'].plot()
Out[137]:

Stock Market Data

Now we read some stock market data from Google finance. I have created direct links to Google and Apple stock price data.

In [169]:
!curl -L -o goog.csv http://tinyurl.com/rces-goog
!curl -L -o aapl.csv http://tinyurl.com/rces-aapl-csv
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  132k    0  132k    0     0   185k      0 --:--:-- --:--:-- --:--:--  185k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  166k    0  166k    0     0   110k      0 --:--:--  0:00:01 --:--:-- 1795k
In [15]:
! head goog.csv
Date,Open,High,Low,Close,Volume
11-Oct-17,973.72,990.71,972.25,989.25,1693274
10-Oct-17,980.00,981.57,966.08,972.60,968362
9-Oct-17,980.00,985.42,976.11,977.00,891355
6-Oct-17,966.70,979.46,963.36,978.89,1173882
5-Oct-17,955.49,970.91,955.18,969.96,1213816
4-Oct-17,957.00,960.39,950.69,951.68,952391
3-Oct-17,954.00,958.00,949.14,957.79,888346
2-Oct-17,959.98,962.54,947.84,953.27,1283444
29-Sep-17,952.00,959.79,951.51,959.11,1580994

We can see that this is well-formated, tidy CSV data, ready for immediate ingestion into Pandas. We use Pandas' amazing read_csv function to do this.

In [170]:
goog = pd.read_csv('goog.csv')
goog.head()
Out[170]:
Date Open High Low Close Volume
0 11-Oct-17 973.72 990.71 972.25 989.25 1693274
1 10-Oct-17 980.00 981.57 966.08 972.60 968362
2 9-Oct-17 980.00 985.42 976.11 977.00 891355
3 6-Oct-17 966.70 979.46 963.36 978.89 1173882
4 5-Oct-17 955.49 970.91 955.18 969.96 1213816

Not bad! But we can do better by giving read_csv some hints.

In [172]:
goog = pd.read_csv('goog.csv', parse_dates=[0], index_col=0)
goog.head()
Out[172]:
Open High Low Close Volume
Date
2017-10-11 973.72 990.71 972.25 989.25 1693274
2017-10-10 980.00 981.57 966.08 972.60 968362
2017-10-09 980.00 985.42 976.11 977.00 891355
2017-10-06 966.70 979.46 963.36 978.89 1173882
2017-10-05 955.49 970.91 955.18 969.96 1213816
In [173]:
goog.info()

DatetimeIndex: 3310 entries, 2017-10-11 to 2004-08-19
Data columns (total 5 columns):
Open      3310 non-null float64
High      3310 non-null float64
Low       3310 non-null float64
Close     3310 non-null float64
Volume    3310 non-null object
dtypes: float64(4), object(1)
memory usage: 155.2+ KB
In [174]:
goog.Close.plot()
Out[174]:
In [180]:
aapl = pd.read_csv('aapl.csv', parse_dates=[0], index_col=0)
aapl.info()

DatetimeIndex: 4000 entries, 2017-10-11 to 2001-11-20
Data columns (total 5 columns):
Open      4000 non-null float64
High      4000 non-null float64
Low       4000 non-null float64
Close     4000 non-null float64
Volume    4000 non-null int64
dtypes: float64(4), int64(1)
memory usage: 187.5 KB
In [176]:
aapl_close = aapl.Close.rename('aapl')
goog_close = goog.Close.rename('goog')
stocks = pd.concat([aapl_close, goog_close], axis=1)
stocks.head()
Out[176]:
aapl goog
Date
2001-11-20 1.40 NaN
2001-11-21 1.41 NaN
2001-11-23 1.42 NaN
2001-11-26 1.53 NaN
2001-11-27 1.50 NaN
In [177]:
stocks.plot()
Out[177]:

Pandas knows how to take correlations. And tons of other computations.

In [181]:
stocks.corr()
Out[181]:
aapl goog
aapl 1.000000 0.917181
goog 0.917181 1.000000

Because it understands times, it can do really cool stuff like resampling.

In [190]:
# resample by taking the mean over each month

fig, ax = plt.subplots()
stocks.resample('MS').mean().plot(ax=ax, colors=['r', 'b'])
# and each year
stocks.resample('AS').mean().plot(ax=ax, colors=['r', 'b'])
/Users/rpa/anaconda/envs/phys_ocean/lib/python3.6/site-packages/pandas/plotting/_core.py:179: UserWarning: 'colors' is being deprecated. Please use 'color'instead of 'colors'
  warnings.warn(("'colors' is being deprecated. Please use 'color'"
Out[190]:

The string QS means "month start. The string AS mean "year start". There is a long list of possible frequency aliases.

We can also apply other reduction operations with resample. These are described in the resample docs.

In [196]:
# get resample object
rs = stocks.goog.resample('MS')
# standard deviation of each month
rs.std().plot()
Out[196]:

Temperature Data

We download some timeseries data from the [Berkeley Earth(http://berkeleyearth.org/) surface temperature dataset. This is timeseries data from various locations around earth. Let's get our local temperatures.

In [207]:
! curl -o nyc_temp.txt http://berkeleyearth.lbl.gov/auto/Local/TAVG/Text/40.99N-74.56W-TAVG-Trend.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  313k  100  313k    0     0   150k      0  0:00:02  0:00:02 --:--:--  150k

If we examine this data, we see it is NOT a well formated CSV file. Loading it will be a bit painful, but Pandas makes the job retatively easy.

In [238]:
! head -72 nyc_temp.txt | tail -8
% 
%                  Monthly          Annual          Five-year        Ten-year        Twenty-year
% Year, Month,  Anomaly, Unc.,   Anomaly, Unc.,   Anomaly, Unc.,   Anomaly, Unc.,   Anomaly, Unc.
 
  1743    11    -2.276  1.665       NaN    NaN       NaN    NaN       NaN    NaN       NaN    NaN
  1743    12       NaN    NaN       NaN    NaN       NaN    NaN       NaN    NaN       NaN    NaN
  1744     1       NaN    NaN       NaN    NaN       NaN    NaN       NaN    NaN       NaN    NaN
  1744     2       NaN    NaN       NaN    NaN       NaN    NaN       NaN    NaN       NaN    NaN
In [4]:
##### http://berkeleyearth.lbl.gov/locations/40.99N-74.56W
# http://berkeleyearth.lbl.gov/auto/Local/TAVG/Text/40.99N-74.56W-TAVG-Trend.txt


#temp = pd.read_csv('nyc_temp.txt')

col_names = ['year', 'month', 'monthly_anom'] + 10*[]
temp = pd.read_csv('nyc_temp.txt',
                   header=None, usecols=[0, 1, 2], names=col_names,
                   delim_whitespace=True, comment='%')

temp.head()
Out[4]:
year month monthly_anom
0 1743 11 -2.276
1 1743 12 NaN
2 1744 1 NaN
3 1744 2 NaN
4 1744 3 NaN
In [5]:
# need a day
date_df = temp.drop('monthly_anom', axis=1)
date_df['day'] = 1
date_index = pd.DatetimeIndex(pd.to_datetime(date_df))
temp = temp.set_index(date_index).drop(['year', 'month'], axis=1)
temp.head()
Out[5]:
monthly_anom
1743-11-01 -2.276
1743-12-01 NaN
1744-01-01 NaN
1744-02-01 NaN
1744-03-01 NaN
In [6]:
temp.plot()
Out[6]:
In [7]:
fig, ax = plt.subplots()
temp.plot(ax=ax)
temp.resample('AS').mean().plot(ax=ax)
temp.resample('10AS').mean().plot(ax=ax)
Out[7]:

Pandas can do both time-based resampling and operation over fixed-length rolling windows. These are very similar but distinct; see discussion in Pandas docs.

In [14]:
# more advanced operation on rolling windows
def difference_max_min(data):
    return data.max() - data.min()

rw = temp.rolling('365D')
rw.apply(difference_max_min).plot()
Out[14]:

To create a "climatology" (i.e. the average of all same months), we can use Pandas' groupby functionality.

In [18]:
# diurnal cycle has been removed!
temp.groupby(temp.index.month).mean().plot()
Out[18]:
In [26]:
# find the hottest years
temp.groupby(temp.index.year).mean().sort_values('monthly_anom', ascending=False).head(10)
Out[26]:
monthly_anom
2012 2.009833
1998 1.862250
2006 1.557583
1949 1.424500
2010 1.395917
1991 1.395583
1990 1.360833
2011 1.310583
2002 1.290500
1953 1.212833

Groupby

Now we will explore groupby's capabilities more in a public dataset from the City of New York: the Rat Information Portal!

In [28]:
# https://data.cityofnewyork.us/Health/Rats/amyk-xiv9
rats = pd.read_csv('https://data.cityofnewyork.us/api/views/amyk-xiv9/rows.csv',
                  parse_dates=['APPROVED_DATE', 'INSPECTION_DATE'])
In [29]:
rats.info()

RangeIndex: 151315 entries, 0 to 151314
Data columns (total 20 columns):
INSPECTION_TYPE                151315 non-null object
JOB_TICKET_OR_WORK_ORDER_ID    151315 non-null int64
JOB_ID                         151315 non-null object
JOB_PROGRESS                   151315 non-null int64
BBL                            151315 non-null int64
BORO_CODE                      151315 non-null int64
BLOCK                          151315 non-null int64
LOT                            151315 non-null int64
HOUSE_NUMBER                   149259 non-null object
STREET_NAME                    151275 non-null object
ZIP_CODE                       151309 non-null float64
X_COORD                        150341 non-null float64
Y_COORD                        150341 non-null float64
LATITUDE                       150930 non-null float64
LONGITUDE                      150930 non-null float64
BOROUGH                        151315 non-null object
INSPECTION_DATE                151315 non-null datetime64[ns]
RESULT                         151315 non-null object
APPROVED_DATE                  151315 non-null datetime64[ns]
LOCATION                       150915 non-null object
dtypes: datetime64[ns](2), float64(5), int64(6), object(7)
memory usage: 23.1+ MB
In [33]:
rats.head()
Out[33]:
INSPECTION_TYPE JOB_TICKET_OR_WORK_ORDER_ID JOB_ID JOB_PROGRESS BBL BORO_CODE BLOCK LOT HOUSE_NUMBER STREET_NAME ZIP_CODE X_COORD Y_COORD LATITUDE LONGITUDE BOROUGH INSPECTION_DATE RESULT APPROVED_DATE LOCATION
0 COMPLIANCE 20090 PO14874 2 3001480046 3 148 46 81 WILLOUGHBY STREET 11201.0 988201.0 191458.0 40.692184 -73.985753 Brooklyn 2010-05-14 13:46:39 Active Rat Signs 2010-05-18 13:00:01 (40.6921838341671, -73.9857525784075)
1 COMPLIANCE 24538 PO17193 2 4123940091 4 12394 91 172-04 LINDEN BOULEVARD 11434.0 1045829.0 191395.0 40.691804 -73.777942 Queens 2010-03-03 09:21:09 Active Rat Signs 2010-03-05 11:30:16 (40.6918037759981, -73.7779418602477)
2 COMPLIANCE 25025 PO15009 2 2029470041 2 2947 41 1881 BELMONT AVENUE 10457.0 1013847.0 247021.0 40.844647 -73.893030 Bronx 2010-01-06 10:45:24 Active Rat Signs 2010-01-14 11:45:09 (40.8446468740184, -73.89302971662)
3 COMPLIANCE 25027 PO15005 2 2032340065 2 3234 65 212 WEST FORDHAM ROAD 10468.0 1009198.0 253356.0 40.862049 -73.909810 Bronx 2010-01-06 15:15:55 Active Rat Signs 2010-01-14 11:44:46 (40.8620488653607, -73.9098097882175)
4 COMPLIANCE 25032 PO15007 2 2043520001 2 4352 1 789 ASTOR AVENUE 10467.0 1021812.0 252577.0 40.859866 -73.864209 Bronx 2010-01-22 15:10:03 Active Rat Signs 2010-01-26 15:16:54 (40.8598660947898, -73.8642093926991)

Let's do some grouping to explore the data.

In [31]:
rats.groupby('INSPECTION_TYPE')['INSPECTION_TYPE'].count()
Out[31]:
INSPECTION_TYPE
COMPLIANCE     51277
INITIAL       100038
Name: INSPECTION_TYPE, dtype: int64
In [32]:
rats.groupby('BORO_CODE')['BORO_CODE'].count().head()
Out[32]:
BORO_CODE
1    46206
2    56984
3    37112
4     9058
5     1955
Name: BORO_CODE, dtype: int64
In [34]:
rats.groupby('STREET_NAME')['STREET_NAME'].count().head(20)
Out[34]:
STREET_NAME
                                     1
   1 AVENUE                         24
   2 AVENUE                         28
   3 AVENUE                         19
   3 STREET                          3
   4 AVENUE                         10
   5 AVENUE                         46
   5 STREET                          3
   6 AVENUE                          9
   6 STREET                          3
   7 AVENUE                         30
   7 AVENUE SOUTH                    1
   8 AVENUE                         13
   9 AVENUE                          8
   9 STREET                          9
  10 AVENUE                          2
  11 AVENUE                          1
  11 STREET                          2
  12 AVENUE                          2
  12 STREET                          4
Name: STREET_NAME, dtype: int64

This dataset clearly needs some cleaning. We can Pandas' text features to strip the whitespace out of the data.

In [39]:
# clean up street name
street_names_cleaned = rats.STREET_NAME.str.strip()
street_names_cleaned.groupby(street_names_cleaned).count().head(20)
Out[39]:
STREET_NAME
                          1
********************     53
0000                      1
1 AVENUE                570
1 PLACE                   2
1 STREET                 22
1/2 2 AVENUE              1
1/2 5 AVENUE              2
1/2 7 AVENUE SOUTH        4
1/2 BEDFORD STREET        1
1/2 BRIGGS AVENUE         1
1/2 DOWNING STREET        1
1/2 EAST 115 STREET       4
1/2 EAST 7 STREET         3
1/2 EAST 9 STREET         1
1/2 KING STREET           1
1/2 ROGERS AVENUE         3
10 AVENUE               226
10 STREET                40
100 STREET                3
Name: STREET_NAME, dtype: int64
In [40]:
count = street_names_cleaned.groupby(street_names_cleaned).count()
count.sort_values(ascending=False).head(20)
Out[40]:
STREET_NAME
GRAND CONCOURSE       2299
BROADWAY              2226
AMSTERDAM AVENUE      1213
3 AVENUE              1188
WEBSTER AVENUE        1138
MORRIS AVENUE         1077
JEROME AVENUE         1030
EASTERN PARKWAY        981
WALTON AVENUE          975
ST NICHOLAS AVENUE     964
PARK AVENUE            961
LAFAYETTE AVENUE       893
VALENTINE AVENUE       826
2 AVENUE               807
LEXINGTON AVENUE       805
BEDFORD AVENUE         791
WASHINGTON AVENUE      768
GRAND AVENUE           748
GREENE AVENUE          745
CRESTON AVENUE         745
Name: STREET_NAME, dtype: int64

To get a better idea of the geography, let's plot the locations of the inspections. But first let's look at the statistics.

In [42]:
rats[['LATITUDE', 'LONGITUDE']].describe()
Out[42]:
LATITUDE LONGITUDE
count 150930.000000 150930.000000
mean 40.739645 -73.890969
std 1.609669 1.991634
min -85.443283 -155.778285
25% 40.704909 -73.956307
50% 40.798729 -73.927440
75% 40.844159 -73.901627
max 40.912863 56.311386

There are clearly some weird outliers in the location data. We need to strip these out before plotting.

In [24]:
valid_latlon = rats[(rats.LATITUDE > 30) & (rats.LONGITUDE < -70)]
valid_latlon.plot.hexbin('LONGITUDE', 'LATITUDE', C='BORO_CODE', cmap='Set1')
Out[24]:
In [34]:
# https://github.com/pandas-dev/pandas/issues/10678
valid_latlon.plot.hexbin('LONGITUDE', 'LATITUDE', sharex=False)
Out[34]:
In [36]:
valid_latlon.plot.hexbin('LONGITUDE', 'LATITUDE', sharex=False, bins='log', cmap='magma')
Out[36]:
In [70]:
manhattan_rats = valid_latlon[valid_latlon.BORO_CODE==1]
manhattan_rats.plot.hexbin('LONGITUDE', 'LATITUDE', sharex=False, bins='log', cmap='magma')
Out[70]:
In [52]:
inspection_date = pd.DatetimeIndex(rats.INSPECTION_DATE)

fig, ax = plt.subplots()
rats.groupby(inspection_date.weekday)['JOB_ID'].count().plot(kind='bar', ax=ax)
ax.set_xlabel('weekday');
In [51]:
fig, ax = plt.subplots()
rats.groupby(inspection_date.hour)['JOB_ID'].count().plot(kind='bar', ax=ax)
ax.set_xlabel('hour');
In [53]:
fig, ax = plt.subplots()
rats.groupby(inspection_date.month)['JOB_ID'].count().plot(kind='bar', ax=ax)
ax.set_xlabel('month')
Out[53]: