Advise for starting a new plug-in for Sentinel 5 precursor

4 posts / 0 new

Thu, 01/07/2016 - 12:40

Maarten Sneep

I'm working on a data-plugin for the upcoming Sentinel 5 precursor mission (S5P, L2 data only, launch somewhere this year). We'd like to have the CIS toolchain available during the commissioning phase so we can compare with a variety of other instruments and models to understand what we do wrong ;-)

I'm one of the developers of the S5P L2 file format, so from that perspective I think I have the knowledge to write the data plugin. I also have several years of Python experience, so I guess I'll be fine there. I just need to get my head around the structure of CIS, and see how some of the idiosyncrasies of the instrument on Sentinel 5 precursor can be handled (some questions follow below).

A small introduction of S5P.

The Sentinel 5 precursor mission carries a single instrument: TROPOMI. Sentinel 5 precursor is an atmospheric composition mission, observing backscattered solar radiance and solar irradiance, and retrieving trace gas columns, ozone profiles, aerosol properties and cloud information from the these spectra. More information on the instrument, sample files and documentation can be found on http://www.tropomi.eu

For the discussion here it is important to know that the file format of S5P (both L1B and L2) is netCDF4, with all the metadata we could think of, as fully CF compliant as we can make it, but using hierarchy (groups) to organise the data. Attributes are used to link to geolocation (and other ancillary data fields), but because the latitude and longitude fields may reside in another group, I fear that some special glue is still needed for these files.

Time storage is always tricky. We use a two step approach: all variables have an initial time dimension (of length 1, so this is a dummy dimension). The value stored in the accompanying time array is the reference time for this orbit, which is UTC midnight before the start of the orbit. The value is stored as seconds since 2010-01-01 (as indicated in the units of the variable). The same reference time is also stored in global attributes in several different units (ISO date/time string, Julian day to keep IDL users happy, days since 1950-01-01 as used in many models, seconds since 1970-01-01 to help out C and Python programmers). The actual time of observation is stored in an additional delta_time variable, which gives the offset with respect to the reference time in milliseconds. This two-step approach solves a few issues (and no doubt creates others): It absorbs leap seconds so level 2 users don't have to bother with those, and it decouples the flight direction from the time dimension. The flight direction is of course closely coupled to the latitude or "Y" dimension (we'll be placed in a polar orbit). The latter allows us to more closely observe the CF order of dimensions: T, (Z), Y, X.

Observations on CIS:

In the CIS documentation I noticed that some code examples use lines that are too long to fit in the available space. This is somewhat annoying, and does not help in understanding the code. Also having nested vertical scroll bars is not nice, but this may well be a limitation of Sphinx itself. The offline version works better so I'll use that instead. In my local copy I figured out where to increase the maximum line-width, so that is solved as well.

In the MODIS example I noticed a small mistake:
"""regex_list = [r'.*' + product + '.*\.hdf' for product in product_names]""" should be """regex_list = [r'.*' + product + r'.*\.hdf' for product in product_names]""" (the raw-modifier applies only to individual strings, not to all parts of an expression, and the character that needs escaping is in the last part).

In the data plugin reference there is also a mistake: """return [r'.*CODE*.nc']""" should be """return [r'.*CODE.*\.nc']""".

Questions:

We retrieve ozone profiles from the observations. From the MODIS example I can't quite get how to organise this: There will be an extra vertical dimension in the profiles, how should that be handled? Initially I will simply skip these products (or at least these fields), as that will make things easier. Another option is to have an additional analysis plugin to create partial integrals over a profile, resulting in a single value per ground pixel. But as I said: this is not a prime priority (but if this requires me to prepare the reading plugin differently I'd like to hear that).

What I also do not quite understand is how granules are dealt with. MODIS has 5 minute granules of constant size. The nominal granule size for S5P comes in two flavours. In near real-time operation the granule size for level 2 is 5 minutes of observations (each observation takes 1080 ms, each observation contains 77 to 450 spectra, depending on the band). For offline processing the granule size is a whole orbit (sunlit side only). This means that the number of observations is not fixed (once per day a solar observation is added, shortening the period for radiance observations). My guess is that the number of spectra per observation (the width of the swath) should be constant, but that for the other dimension the only requirement is that they match the latitude & longitude of that granule, not of other granules.

The next issue that I need to resolve is an instrument feature: the 4 detectors - and therefore the different products derived from different parts of the spectrum - will have a different geospatial sampling. When loading a single product this isn't an issue, but when loading products derived from two different detectors (say tropospheric NO₂ and cloud properties) this will have to be handled. Of course when deriving the tropospheric NO₂ column we already take this mismatch into account when we use the cloud product, but we'll want to perform comparisons involving multiple products. We have extra knowledge on how to combine different bands, as the mapping is fixed, at least between four of the 8 bands. We do need an extra lookup table for this. How is the interaction with the cis command line tool to get access to an extra lookup table? Independent mapping to a fixed L3 grid is also possible, but especially cloud products from different orbits should not be aggregated onto a L3 grid due to (natural) variability.

A final question, related the the previous: S5P will fly in an afternoon orbit, relatively close to the A-train, but actually in lose formation with NPP. For offline processing the cloud product from the VIIRS instrument will be mapped onto the S5P observational grid for cloud screening. It would be nice to be able to map L2 from OMI/Aura and MODIS/Aqua to S5p L2 for comparison without going through L3. Can CIS be of use here? Of course an additional plugin for OMI must be made, but the limitations are very similar to those in TROPOMI.

Sorry for the long post, and thanks for reading up to here.

Maarten Sneep (KNMI)

Fri, 01/08/2016 - 09:41

duncanwp

Re: Advise for starting a new plug-in for Sentinel 5 precursor

Hi Maarten,

First of all thanks for using CIS and getting in touch!

Fantastic! I've been thinking about how we could get some of the upcoming Sentinel datasets on board so I'm happy to help. Thanks for the background as well, it's very helpful.

In the CIS documentation I noticed that some code examples use lines that are too long to fit in the available space. This is somewhat annoying, and does not help in understanding the code

Thanks for the feedback, I'll see what I can do about reformatting the examples to help readability. In terms of the scroll bars, you might want to try using our 'readthedocs' page:
http://cis.readthedocs.org I find these are much easier to navigate.

On to the questions:

We retrieve ozone profiles from the observations. From the MODIS example I can't quite get how to organise this: There will be an extra vertical dimension in the profiles, how should that be handled?

Vertical components in themselves aren't difficult to include, they are treated as just another coordinate, as we do in some of the other products, see e.g. https://github.com/cedadev/cis/blob/master/cis/data_io/products/caliop.py. Note that in many cases like this we have to artificially expand the dataset along one or more dimensions so that the data is treated as completely unstructured (even though in this case there is obviously some structure). We're not entirely satisfied with this, but it does allow us to treat a wide variety of different data sets on an equal footing. One way to avoid explicitly copying large amounts of data and increasing the memory overhead is to use this numpy strides trick (http://stackoverflow.com/questions/5564098/repeat-numpy-array-without-re...) but I haven't had a chance to implement it in e.g. caliop reading yet. In the medium term I'll be looking to create a 'semi-structured' data type, between purely un-gridded, and purely gridded, which should help a lot for these cases. I think this probably also answers your question about granules as well though, because the variable granule size won’t matter.

This is a little trickier, and not something we've really considered so far. There are two options that I can think of off the top of my head though. The first is to create two distinct CIS products for each detector product, one which reads the data on it's 'native' geospatial sampling, and another which reads the product on another detector's sampling (and uses a lookup table to do the resampling). The second option is to create a separate collocation plugin which allows users to specify a resampling lookup table. This second option is something we've considered anyway for a slightly different use case of repeated collocations of datasets which are on the same sampling - but isn't a current priority. The advantage of the first option in your case is that you would be able to use all of the CIS functionality (plotting, subsetting, aggregation etc) transparently on either sampling.

It would be nice to be able to map L2 from OMI/Aura and MODIS/Aqua to S5p L2 for comparison without going through L3. Can CIS be of use here?

Yes! This is exactly what we designed CIS to be able to do, and we feel it's what sets us apart from some of the other toolboxes, so it's great to hear that it's not just us that this would be useful for :-) Because we treat this kind of collocation in a general way it won't be quite as fast as if you wrote a dedicated algorithm taking into account the similar orbit, but it means you could compare the S5p data with the MODIS collection 5 data we can already read as soon as you have your plugin. (And if performance was really an issue, again we could consider a special collocation plugin which might help).

For me one of the nice things about CIS is users being able to write their own plugins for datasets which they know and use, and being able to share them with other people, so I'm excited to hear it's starting to happen. Do let me know how you get on, and if you have any other questions or comments just let me know!

Thanks,

Duncan

P.s. I'm not sure if you saw, but we're actually running a workshop in Oxford on January 15th which will be covering plugin development, so if you're able to come it should be useful.

Fri, 01/08/2016 - 14:02

(Reply to #2) #3

Maarten Sneep

Thanks for the reply. First

Thanks for the reply. First the practical side: I won't be able to arrange a trip on such short notice, maybe some later workshop.

For your information: we are actively involved in the L2 file format discussion for Sentinel 4 and Sentinel 5. These are likely to follow the same or a very similar structure. I will purposefully restrict my plug-in to S5P, as the S4 and S5 file format discussions are in an early stage and we do not know the final outcome at this moment. The S4 and S5 plug-ins may be as simple as a subclass with just a different get_file_signature() method.

In the meantime I've started to look at the netCDF module you provide. It makes a few assumptions that are not completely valid for S5P files. The dimensions used by these L2 files are not in the root-level of the file, so the get_netcdf_file_variables() function will not filter out dimensions. Besides: even though we use hierarchy for organisation, it may be more convenient to treat the file as a flat structure, and handle the hierarchy internally. In our conventions we recommend against using variables with the same name in different groups, so a variable name is unique. As a side note: all our group and variable names must be legal python variable names (no spaces and periods allowed in group and variable names. Yes, we've learned from the limitations in OMI files.

The current file format is a combination of the experiences from KNMI and DLR in dealing with OMI and GOME-2, and of course the earlier missions as well. In addition some of the ideas were adopted from the S5P L1B format, in particular the time storage and the storage of granule metadata, all metadata is stored as attributes to a hierarchy of groups, there is no need for a separate metadata side-file. This also means that all metadata can be accessed using a single API (netCDF4). This is a significant improvement over the OMI files where some rather essential metadata (i.e. orbit number) had to be extracted from an ODL formatted string somewhere in the file. Because both data and metadata are stored in a single file, the dimensions are moved into a sub-group of the root level, to more cleanly separate data from metadata.

For now I'll skip the profiles by filtering out fields that have too many dimensions, we can deal with this later.

As for the detector mapping: How does that extra lookup table reach my code? Is this something that can be specified on the cis command-line and reach my code? Or should I configure this differently?

Best,

Maarten

Wed, 01/13/2016 - 12:18

duncanwp

Re: Advise for starting a new plug-in for Sentinel 5 precursor

Besides: even though we use hierarchy for organisation, it may be more convenient to treat the file as a flat structure, and handle the hierarchy internally.

Sure, we do something very similar really, by just flattening hierarchical variable names. But feel free to use whichever method you find most appropriate.

As for the detector mapping: How does that extra lookup table reach my code? Is this something that can be specified on the cis command-line and reach my code? Or should I configure this differently?

We have a similar issue with orographic data associated with model files and we deal with that by just asking the user to include the orography file in their list of data files. So if your plugin just expects one of the files the user passes in to include a file containing the mapping (by giving it a specific filename or NetCDF variable name) then you can use it directly in your plugin.

Hope that helps, as I say if you have any other questions just give me a shout.