# SIproc

### Description

SIproc is a command-line tool focused on out-of-core data processing and classification of hyperspectral images. This software is optimized around algorithms that can be applied to data streamed from secondary storage (ex. hard drives and NAS). This allows SIproc to perform image processing and segmentation on data sets that are several terabytes in size.

SIproc consists of multiple tools that perform specific functions in the image segmentation chain:

• SIproc - Image preprocessing, noise reduction, basic analysis, and dimension reduction
• SIview - Hyperspectral image visualization and browsing
• SItrain - Classifier training for k-means clustering and random forests
• SIpredict - Classifier prediction for k-means clustering and random forests
• SIstain - Digital staining of hyperspectral images based on ground truth acquired using a different imaging modality. This is based on the work described in Mayerich, et al., Technology, 3(1) 2015

### Usage

All applications are command-line driven, and designed to be combined using a scripting language such as Python. A list of valid arguments, as well as usage examples, can be retrieved using the --help argument:

>> siproc --help

The general usage for all applications is:

si**** [input file] [output file] --algorithm [param1 param2 ...] --mask maskfile.bmp

Where [input file] is the initial file containing relevant hyperspectral information, [output file] is the file produced by the desired algorith, algorithm is the algorithm to be used for data processing (ex. baseline correction, normalization, PCA), and mask allows the user to provide a binary (black/white) image specifying which pixels to which the algorithm will be applied.

For example, assume that we wish to normalize a hyperspectral image by taking a ratio to the Amide I protein band (1650cm). Since this result would be undefined for pixels that don't contain tissue, we can create a mask using the unprocessed data:

siproc original_file mask.bmp --build-mask 1650 0.1

This will generate a binary image mask.bmp where:

• all pixels at the 1650cm band with values  are white
• all pixels at the 1650cm band with values  are black

We can then normalize only these pixels to create another hyperspectral image:

siproc original_file normalized_file --normalize 1650 --mask mask.bmp

The final output file normalized_file will have normalized spectral values at masked pixels, while unmasked pixels will be set to  (zero). Limiting an algorithm by using a mask will generally save a significant amount of computational time, especially when the file is in a BIP format.

### Compiling

This code has been tested on Ubuntu and Windows. All of the necessary libraries are available on Ubuntu using aptitude. Windows requires manual installation of all libraries. The only source repository required is the STIM-Lib repository, which can be directly downloaded or pulled via Git:

#### STIM-Lib

The following libraries are required to build SIproc:

Boost (only required for Linux, use aptitude)

#### GLEW

• Windows: http://glew.sourceforge.net/
• Windows: set the environment variable GLEW_ROOT to the directory containing the lib and /GL directory
• Ubuntu: aptitude will install this correctly

#### LAPACKE/LAPACK/BLAS

• Windows: you really have to compile this yourself (sorry, I recommend the MinGW 64-bit option):
• http://www.netlib.org/lapack/
• set the environment variables LAPACKE_PATH and LAPACK_PATH to the directory containing the lapacke header files and all necessary lib files
• Ubuntu: aptitude will install this correctly

### Tutorial

In this section, we will provide an example of hyperspectral image processing from data collection to tissue classification.

#### Data Acquisition

We first acquire a data set using a mid-infrared imaging system. In this case, an Agilent Cary 620 FTIR Microscope. These instruments use Resolutions Pro to manage image acquisition. After acquiring the image, you can use the Resolutions Pro software to create an ENVI header file describing the binary format of the images by using the "Export to ENVI" option.

The individual images collected using the Cary system can then be reconstructed into a complete mosaic using:

siproc ./directory/containing/image mosaic_image --mosaic 128

Where the parameter of --mosaic is the size of the focal plane array in the Cary 620 system. This will combine all of the FPA fields produced by the Cary system into a single mosaic_image file and corresponding mosaic_image.hdr header file. Note that this process will vary depending on the imaging instrumentation used. Our software is designed to be used with binary images with ENVI header files. This header format is openly available, if you would like to generate your own.

Since you may not have access to an FTIR imaging system or viable data, this tutorial can be completed by using a breast biopsy data set that we have uploaded for this purpose. This data set is an image of four breast biopsy cores from the BRC961 tissue microarray (TMA) acquired from Biomax Inc. This data set is approximately 2.2GB in size and can be downloaded using BitTorrent Sync with the sync key: BKQBA3B5MBL2F4GFADUJYVLYYSUSZ7NOR

#### Data Preprocessing

We will now describe a standard data processing pipeline commonly used for FTIR images. This will include:

2. baseline correction
3. normalization
4. converting data (optional)
5. principal component analysis
6. dimension reduction

At each step, an image of the data set can be collected at a specified band (here 1650cm) using:
siproc mosaic_image image.bmp --image 1650

and a spectrum can be collected at a specified [x, y] pixel (here [126, 96]) using:
siproc mosaic_image spectrum.csv --spectrum 126 96

##### 1. Generate a Mask (optional)

Using a mask for valid pixels is a good way to improve processing time and improve the accuracy of your algorithms. The provided image uses TMA cores, so there are several pixels in the regions surrounding the cores that do not contain useful data. A mask of the valid pixels can be created by thresholding an image of the Amide I band to values between 0.02 and 2:
siproc mosaic_image mask.bmp --threshold 1650 0.02 2

The application of a mask to various algorithms is not necessary, but is recommended when processing large data sets that contain empty pixels. Since our algorithms are executed out-of-core, this prevents the transfer of unnecessary data from secondary storage (ex. hard disk or network).

##### 2. Baseline Correction

Baseline correction can be performed by specifying a set of baseline points, or locations where the expected absorbance is  (zero):
siproc mosaic_image baseline_image --baseline 750 778 816 870 892 ... --mask mask.bmp

These points can be specified in a text file, making them easier to apply to several data sets. An example baseline file is provided in the tutorial data set (above) and can be applied using:
siproc mosaic_image baseline_image --baseline baseline.txt --mask mask.bmp

##### 3. Normalization

Normalization can be performed by specifying a band that will be used to ratio the rest of the data set. A common band in tissue images is Amide I at 1650cm:

siproc baseline_image normalized_image --normalize 1650 --mask mask.bmp

Using a mask for normalization is highly recommended, since pixels containing values near zero in the normalization band can result in NaN values in the final image. While our algorithms should handle this data appropriately, you may run into problems when exporting your data to other software packages (ex. ENVI).

##### 4. Conversion (optional)

While thet SIproc algorithms use out-of-core techniques to optimally stream the image data from secondary storage (ex. your hard drive), the processing speed for some algorithms can be significantly affected by the data orientation on disk. Most algorithms that treat spectra independently will exhibit the best performance on images in a BIP (bands interleaved by pixel) format. For example, we can significantly speed up the covariance matrix calculation, which is a necessary part of principal component analysis, by converting the image to a BIP format:
siproc normalized_image normalized_image_bip --convert bip
While this is not strictly necessary, SIproc will notify you if conversion is recommended. In which case, you can decide whether to convert or continue. Note that if you are using magnetic media for storage, converting to a BIP format can result in orders of magnitude reduction in processing time.

##### 5. Principal Component Analysis

One of the most common methods for dimension reduction is principal component analysis (PCA), which identifies a new set of basis vectors that maximize variance in the spectral signal. The principal components can be calculated and stored in a CSV file containing all of the necessary statistical information:

siproc normalized_image_bip pca_stats.csv --pca --mask mask.bmp

The resulting pcastats.csv contains (1) the mean spectrum and (2) the set of PC basis functions ordered based on variance.

##### 6. Principal Component Rotation

Finally the resulting statistics file can be used to rotate the data and extract the corresponding PCA loadings. This step is often used for dimension reduction, so the number of principle components to be used can also be specified:

siproc normalized_image_bip pca_image --project pca_stats.csv 30

In this case we keep  principal components, which significantly reduces the size of the data set.

#### Classification

Image classification is performed by generating an annotated set of images that overlay onto your hyperspectral data. There are several software packages that can be used to generate these annotations, including Adobe Photoshop and GIMP. Annotations are provided in the tutorial data as a series of common tissue types: collagen, epithelium, fibroblasts, myofibroblasts, and necrotic tissue:

 Collagen Epithelium Fibroblasts Myofibroblasts Necrotic

The HSIclass executable can be used to create a classifier using the annotated data and hyperspectral image:

siclass pca_image classifier.rf --train class_coll.bmp class_epith.bmp class_fibro.bmp class_myo.bmp class_necrosis.bmp
This will generate a random forest classifier and store the necessary algorithm information in the XML file classifier.rf. Parameters defining the structure of the classifier can also be provided. A full list of options can be displayed using:
siclass --help

Once a classifier has been trained, it can then be applied to other hyperspectral images. For example, we can apply our trained classifier to our original data set:
siclass pca_image class*.bmp --classify classifier.rf --mask mask.bmp
This will generate a set of class images identifying pixels corresponding to each class (similar to the masks used to train the classifier):

 Collagen Epithelium Fibroblasts Myofibroblasts Necrotic

Alternatively, a single color-coded image can be generated by specifying class colors using the --color command:
siclass pca_image class_image.bmp --classify classifier.rf --mask mask.bmp --colors red green blue yellow magenta 

The classification results can be quantified using multiple methods, such as a confusion matrix:
siclass pca_image confusion.csv --validate classifier.rf class_coll.bmp class_epith.bmp class_fibro.bmp class_myo.bmp class_necrosis.bmp

This will produce a CSV file containing the confusion matrix:

 12829 207 33 13 0 161 17548 22 65 21 684 21 6144 544 0 109 1224 2585 3006 1 0 70 0 1 7385

Since the Random Forest algorithm has a stochastic component, your matrix results may vary slightly.