Description
SIproc is a command-line tool focused on out-of-core data processing and classification of hyperspectral images. This software is optimized around algorithms that can be applied to data streamed from secondary storage (ex. hard drives and NAS). This allows SIproc to perform image processing and segmentation on data sets that are several terabytes in size.
SIproc consists of multiple tools that perform specific functions in the image segmentation chain:
- SIproc - Image preprocessing, noise reduction, basic analysis, and dimension reduction
- SIview - Hyperspectral image visualization and browsing
- SItrain - Classifier training for k-means clustering and random forests
- SIpredict - Classifier prediction for k-means clustering and random forests
- SIstain - Digital staining of hyperspectral images based on ground truth acquired using a different imaging modality. This is based on the work described in Mayerich, et al., Technology, 3(1) 2015
Usage
All applications are command-line driven, and designed to be combined using a scripting language such as Python. A list of valid arguments, as well as usage examples, can be retrieved using the --help argument:
>> siproc --help
The general usage for all applications is:
si**** [input file] [output file] --algorithm [param1 param2 ...] --mask maskfile.bmp
Where [input file] is the initial file containing relevant hyperspectral information, [output file] is the file produced by the desired algorith, algorithm is the algorithm to be used for data processing (ex. baseline correction, normalization, PCA), and mask allows the user to provide a binary (black/white) image specifying which pixels to which the algorithm will be applied.
For example, assume that we wish to normalize a hyperspectral image by taking a ratio to the Amide I protein band (1650cm). Since this result would be undefined for pixels that don't contain tissue, we can create a mask using the unprocessed data:
siproc original_file mask.bmp --build-mask 1650 0.1
This will generate a binary image mask.bmp where:
- all pixels at the 1650cm
band with values
are white
- all pixels at the 1650cm
band with values
are black
We can then normalize only these pixels to create another hyperspectral image:
siproc original_file normalized_file --normalize 1650 --mask mask.bmp
The final output file normalized_file will have normalized spectral values at masked pixels, while unmasked pixels will be set to (zero). Limiting an algorithm by using a mask will generally save a significant amount of computational time, especially when the file is in a BIP format.
Compiling
This code has been tested on Ubuntu and Windows. All of the necessary libraries are available on Ubuntu using aptitude. Windows requires manual installation of all libraries. The only source repository required is the STIM-Lib repository, which can be directly downloaded or pulled via Git:
STIM-Lib
- source repository: https://git.stim.ee.uh.edu/codebase/stimlib
- (set the environment variable STIMLIB_PATH to the directory of this repository)
The following libraries are required to build SIproc:
Boost (only required for Linux, use aptitude)
CUDA Toolkit
- https://developer.nvidia.com/cuda-toolkit
- all environment variables should be set automatically when CUDA is installed
GLUT
- Windows: http://www.transmissionzero.co.uk/software/freeglut-devel/
- Windows: set the environment variable GLUT_ROOT_PATH to the directory containing the /lib and /GL directories
- Ubuntu: aptitude will install this correctly
GLEW
- Windows: http://glew.sourceforge.net/
- Windows: set the environment variable GLEW_ROOT to the directory containing the lib and /GL directory
- Ubuntu: aptitude will install this correctly
LAPACKE/LAPACK/BLAS
- Windows: you really have to compile this yourself (sorry, I recommend the MinGW 64-bit option):
- http://www.netlib.org/lapack/
- set the environment variables LAPACKE_PATH and LAPACK_PATH to the directory containing the lapacke header files and all necessary lib files
- Ubuntu: aptitude will install this correctly
Tutorial
In this section, we will provide an example of hyperspectral image processing from data collection to tissue classification.
Data Acquisition
We first acquire a data set using a mid-infrared imaging system. In this case, an Agilent Cary 620 FTIR Microscope. These instruments use Resolutions Pro to manage image acquisition. After acquiring the image, you can use the Resolutions Pro software to create an ENVI header file describing the binary format of the images by using the "Export to ENVI" option.
The individual images collected using the Cary system can then be reconstructed into a complete mosaic using:
siproc ./directory/containing/image mosaic_image --mosaic 128
Where the parameter of --mosaic is the size of the focal plane array in the Cary 620 system. This will combine all of the FPA fields produced by the Cary system into a single mosaic_image file and corresponding mosaic_image.hdr header file. Note that this process will vary depending on the imaging instrumentation used. Our software is designed to be used with binary images with ENVI header files. This header format is openly available, if you would like to generate your own.
Data Download (optional)
Since you may not have access to an FTIR imaging system or viable data, this tutorial can be completed by using a breast biopsy data set that we have uploaded for this purpose. This data set is an image of four breast biopsy cores from the BRC961 tissue microarray (TMA) acquired from Biomax Inc. This data set is approximately 2.2GB in size and can be downloaded using BitTorrent Sync with the sync key: BKQBA3B5MBL2F4GFADUJYVLYYSUSZ7NOR
Data Preprocessing
We will now describe a standard data processing pipeline commonly used for FTIR images. This will include:
- generating a mask (optional)
- baseline correction
- normalization
- converting data (optional)
- principal component analysis
- dimension reduction
At each step, an image of the data set can be collected at a specified band (here 1650cm) using:
siproc mosaic_image image.bmp --image 1650
and a spectrum can be collected at a specified [x, y] pixel (here [126, 96]) using:
siproc mosaic_image spectrum.csv --spectrum 126 96
1. Generate a Mask (optional)
Using a mask for valid pixels is a good way to improve processing time and improve the accuracy of your algorithms. The provided image uses TMA cores, so there are several pixels in the regions surrounding the cores that do not contain useful data. A mask of the valid pixels can be created by thresholding an image of the Amide I band to values between 0.02 and 2:
siproc mosaic_image mask.bmp --threshold 1650 0.02 2
The application of a mask to various algorithms is not necessary, but is recommended when processing large data sets that contain empty pixels. Since our algorithms are executed out-of-core, this prevents the transfer of unnecessary data from secondary storage (ex. hard disk or network).
2. Baseline Correction
Baseline correction can be performed by specifying a set of baseline points, or locations where the expected absorbance is (zero):
siproc mosaic_image baseline_image --baseline 750 778 816 870 892 ... --mask mask.bmp
These points can be specified in a text file, making them easier to apply to several data sets. An example baseline file is provided in the tutorial data set (above) and can be applied using:
siproc mosaic_image baseline_image --baseline baseline.txt --mask mask.bmp
3. Normalization
Normalization can be performed by specifying a band that will be used to ratio the rest of the data set. A common band in tissue images is Amide I at 1650cm:
siproc baseline_image normalized_image --normalize 1650 --mask mask.bmp
Using a mask for normalization is highly recommended, since pixels containing values near zero in the normalization band can result in NaN values in the final image. While our algorithms should handle this data appropriately, you may run into problems when exporting your data to other software packages (ex. ENVI).
4. Conversion (optional)
While thet SIproc algorithms use out-of-core techniques to optimally stream the image data from secondary storage (ex. your hard drive), the processing speed for some algorithms can be significantly affected by the data orientation on disk. Most algorithms that treat spectra independently will exhibit the best performance on images in a BIP (bands interleaved by pixel) format. For example, we can significantly speed up the covariance matrix calculation, which is a necessary part of principal component analysis, by converting the image to a BIP format:
siproc normalized_image normalized_image_bip --convert bip
While this is not strictly necessary, SIproc will notify you if conversion is recommended. In which case, you can decide whether to convert or continue. Note that if you are using magnetic media for storage, converting to a BIP format can result in orders of magnitude reduction in processing time.
5. Principal Component Analysis
One of the most common methods for dimension reduction is principal component analysis (PCA), which identifies a new set of basis vectors that maximize variance in the spectral signal. The principal components can be calculated and stored in a CSV file containing all of the necessary statistical information:
siproc normalized_image_bip pca_stats.csv --pca --mask mask.bmp
The resulting pcastats.csv contains (1) the mean spectrum and (2) the set of PC basis functions ordered based on variance.
6. Principal Component Rotation
Finally the resulting statistics file can be used to rotate the data and extract the corresponding PCA loadings. This step is often used for dimension reduction, so the number of principle components to be used can also be specified:
siproc normalized_image_bip pca_image --project pca_stats.csv 30
In this case we keep principal components, which significantly reduces the size of the data set.
Classification
Image classification is performed by generating an annotated set of images that overlay onto your hyperspectral data. There are several software packages that can be used to generate these annotations, including Adobe Photoshop and GIMP. Annotations are provided in the tutorial data as a series of common tissue types: collagen, epithelium, fibroblasts, myofibroblasts, and necrotic tissue:
![]() |
![]() |
![]() |
![]() |
![]() |
Collagen | Epithelium | Fibroblasts | Myofibroblasts | Necrotic |
The HSIclass executable can be used to create a classifier using the annotated data and hyperspectral image:
siclass pca_image classifier.rf --train class_coll.bmp class_epith.bmp class_fibro.bmp class_myo.bmp class_necrosis.bmp
This will generate a random forest classifier and store the necessary algorithm information in the XML file classifier.rf. Parameters defining the structure of the classifier can also be provided. A full list of options can be displayed using:
siclass --help
Once a classifier has been trained, it can then be applied to other hyperspectral images. For example, we can apply our trained classifier to our original data set:
siclass pca_image class*.bmp --classify classifier.rf --mask mask.bmp
This will generate a set of class images identifying pixels corresponding to each class (similar to the masks used to train the classifier):
![]() |
![]() |
![]() |
![]() |
![]() |
Collagen | Epithelium | Fibroblasts | Myofibroblasts | Necrotic |
Alternatively, a single color-coded image can be generated by specifying class colors using the --color command:
siclass pca_image class_image.bmp --classify classifier.rf --mask mask.bmp --colors red green blue yellow magenta
The classification results can be quantified using multiple methods, such as a confusion matrix:
siclass pca_image confusion.csv --validate classifier.rf class_coll.bmp class_epith.bmp class_fibro.bmp class_myo.bmp class_necrosis.bmp
This will produce a CSV file containing the confusion matrix:
12829 | 207 | 33 | 13 | 0 |
161 | 17548 | 22 | 65 | 21 |
684 | 21 | 6144 | 544 | 0 |
109 | 1224 | 2585 | 3006 | 1 |
0 | 70 | 0 | 1 | 7385 |
Since the Random Forest algorithm has a stochastic component, your matrix results may vary slightly.