Coarse
Semantic Segmentation with Multiscale Dimensionality Reduced Subblocks
https://drive.google.com/file/d/1RIiajKYdxFUbOvZvl42x1kssDJFxO6dv/view?usp=sharing
Motivation
The semantic segmentation task.
The typical goal of semantic
segmentation is to assign each pixel in an image to one of a few semantic
classes. In the example shown here, each region of the image of a street is
labeled as a road, sidewalk, cat, building etc. Oftentimes this problem is
reduced to finding a binary pixel mask for each of the possible classes (is
this pixel part of a car, or not?), then combining these class-wise masks into
a single semantic map.
Applications of semantic segmentation.
Semantic Segmentation has a wide
array of applications in many industries, and is an important intermediate step
in creating truly intelligent computer-vision systems. In the context of
self-driving vehicles, knowing where roads, pedestrians, and other objects are
in a single image is a necessary step to determine where these objects are in
3D space. Semantic segmentation is also used in medical image analysis in order
to determine the locations of tumors and other defects.
Unfortunately, generalized semantic
segmentation is among the hardest computer vision tasks because to predict the
class of any single pixel requires high level of contextual knowledge about
that pixel and its surroundings. There is almost no
way to predict the semantic class of a single pixel using only its RGB values
directly. A Blue-Greyish pixel could be part of a sky, a road, a building, a
car, a road, or a jacket. It's the challenge of semantic segmentation combined
with the clear utility that has us interested in the task.
Current State of the Art
UNET
Our Approach
Motivation
A major challenge of semantic
segmentation, or any other machine learning techniques in computer vision is
the curse of dimensionality. For a modestly-sized 480x360 color image, there
are 480 * 360 * 3 = 518,400 values to consider (one for each RGB value of each
pixel). And for semantic segmentation, the output is of size 480 * 360 =
172,800 per class. To train neural
networks with data of this dimensionality requires a vast amount of training
data and computational power, which are both expensive.
In
short, the goal of our approach is to create a learning-based approach to
semantic segmentation that creates a "smaller" predictive model than current
convolutional methods. This in turn reduces the amount of training data
required to train the model, and speeds up the learning process. This is done
by dramatically reducing the dimensionality of the input to our predictive model.
Multiscale Sliding Windows
One way that we accomplish this is
by creating a model that makes a prediction only about a small "chunk" of an
image. In our case, we predict the semantic composition of only a 32x32 chunk
of a high-resolution image. However, in order to make a prediction about that
small chunk, it helps to also know the visual "context" of the area surrounding
that chunk. This context is captured by using a multiscale "pyramid" of
progressively larger but lower-fidelity chunks surrounding the "core" chunk we
wish to make a prediction on.
In
order to perform semantic segmentation on the whole image, we then slide this
pyramid across the whole image, giving us a semantic prediction of every 32x32
subblock.
Texture Dimensionality Reduction with PCA
As mentioned, image data is
inherently very high-dimensional.
Limiting the prediction inputs to the images of the multiscale pyramid helps
reduce this dimensionality somewhat, but not enough to make a truly small model.
So, we use Principle Component Analysis to reduce the input dimensionality even
further.
Consider a single 32x32 black &
while image. Similar to how we can use the fourier
transform / discrete cosine transform to perform a "basis" transformation, giving
us the original 32x32 image in different basis coefficients, we can also use
PCA to get different basis coefficients for the image. We can then "drop" most
of these basis coefficients, obtaining a compact "representation" of the
original 32x32 image. If we want to perform a prediction about this 32x32
image, we can use maybe 90 of the most important PCA coefficients instead of
the whole 1024 pixel values.
Color Dimensionality Reduction with PCA
Beyond using PCA to reduce the
dimensionality of the "Texture" information of an image sub-block, we can also
use PCA to reduce the dimensionality of the color information of each image.
Images are typically thought of as having red, green, and blue channels. To dimensionality-reduce
an image, we could apply the PCA texture encoding to the red, green, & blue
channels of an image. However, if you apply PCA to the RGB values of an image,
you get a more efficient way to represent an image. You get three new channels,
specific to that image, where the first channel is "most important", the second
one less important than the first channel, and the last channel least
important. For example, this small image of some tree roots with mulch is well
approximated when using only ⅔ "PCA" channels.
Instead
of applying PCA texture dimensionality reduction to the red, green, and blue
channels, we apply it to the 3 custom "PCA" channels. Because we know that the
channels are ordered by importance, we encode the most texture information
about the first channel, and less in the other two channels.
Putting it together: the encoding.
In order to predict the semantic
composition of a single "core" block, we encode the visual information of that
block, and larger blocks in that multiscale pyrymic
into a single vector, reducing the dimensionality of the input by a large
margin.
Task
Cityscapes Dataset
Utilize the cityscapes dataset to
classify objects within the city, such as cars, foliage, sky, road, and signs.
Our inspiration for using this being self-driving cars.
Constraints
No corners for now. Also we just
predict the semantic composition of a core block, not which pixels actually
belong to which class.
Results
Model Size
Our model size contains a total of
47,998 parameters, which is tiny in comparison to UNET's massive 2,060,424
parameters in their model.
Training Time
Given this massive difference in the
number of parameters, the training times are also vastly different. Our neural
network could be trained within 20 minutes while utilizing 30% of a laptop
processor, meanwhile UNET's pixel-wise segmentation model used a graphics card
while still taking 400 minutes to train.
Inference/ Prediction Time
At this moment, it takes 60 seconds in order to process and predict the labels on the image.
Accuracy
The accuracy of our 47,998 parameter
model has an accuracy of 82.79%. Our 203,630 parameter model has an accuracy of
90.08%, and our 845,982 has an accuracy of 93.04%. Each model was only trained
for 20 minutes.
The pictures below are examples of
the model in action using the 47,998 parameter model, with the top three being
the actual labels taken from the image, and the bottom three being the ones our
model had predicted.
Successes:
Areas of improvement:
Future Work
Inference Time is slow
Inference time is slow with it taking 60
seconds to predict all of the labels within the large image.
Corners
We want our method of image segmentation
to also work on corners, at this moment it cannot reach the corners due to the
lack of information received from our current algorithm.
Follow Up with a coarse -> fine step.
We would like to refine our coarse
segmentation to a pixel-wise, fine segmentation as to have more data to work
with. Having the coarse segmentation limits our ability to have a completely
accurate prediction due to losing some fine information which is important at
times.
References
●
The cityscapes dataset
○
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler,
M., Benenson, R., ... & Schiele, B. (2016). The cityscapes dataset for
semantic urban scene understanding. In
Proceedings
of the IEEE conference on computer vision and pattern recognition
(pp.
3213-3223).
●
PCA color reduction
○
De Lathauwer, L., &
Vandewalle, J. (2004). Dimensionality reduction in higher-order signal
processing and rank-(R1, R2,…, RN) reduction in multilinear algebra. Linear Algebra and its Applications, 391, 31-55.
●
CNNs:
○
LeCun
, Y., Haffner, P., Bottou, L.,
& Bengio, Y. (1999). Object recognition with
gradient-based learning. In
Shape,
contour and grouping in computer vision
(pp. 319-345). Springer, Berlin,
Heidelberg.
●
UNET:
○
Ronneberger
, O., Fischer, P., & Brox, T. (2015, October). U-net:
Convolutional networks for biomedical image segmentation. In
International Conference on Medical image
computing and computer-assisted intervention
(pp. 234-241). Springer, Cham.
●
ClidingWindow
, then CNN:
○
Vigueras-Guillén, J. P., Sari, B., Goes, S. F., Lemij, H.
G., van Rooij, J., Vermeer, K. A., & van Vliet,
L. J. (2019). Fully convolutional architecture vs sliding-window CNN for
corneal endothelium cell segmentation.
BMC
Biomedical Engineering
, 1(1),
1-16.
Link to GitHub repo:
https://github.com/benthehuman1/UW-Vision-Segmentation