PassionatePossum

PassionatePossum t1_iu3gete wrote

I have done it before and it works well. But I guess it depends on the use-case. It is a classic technique in computer vision to cluster SIFT vectors (128 dimensions) on a training dataset. You then describe any image as a set of „visual words“ (i.e. the IDs of the clusters its SIFT vectors fall into).

A colleague of mine wrote the clustering algorithm himself. It was just a normal k-means with the nearest neighbor search replaced by an approximate nearest neighbor search to speed things up.

1

PassionatePossum t1_it9djl0 wrote

I understand, if you are working on an academic paper or something like that. In that case novelty is important. If you are working in industry - as I currently am- I have no such concerns. In industry, skill is to produce a working solution fast and if someone has already built a framework that I am allowed to use, even better.

1

PassionatePossum t1_it6ms0v wrote

>the TPR and FPR are both fractions, so it won't matter if one is a larger class than the other.

In most cases that is a desirable property. You don't want to have excellent results just because one class makes up 99% of your dataset and the classifier just predicts the most common class without learning anything. Precision and Recall are also fractions.

The difference between ROC and Precision/Recall is that ROC needs the concept of a "negative class". That can be problematic for multi-class problems. Even if your data is perfectly balanced across all of your classes, the negative class (i.e. all classes that aren't the class you are examining) is bound to be overrepresented.

Since you only need the positive examples for a precision/recall plot you don't have that problem.

So, I don't have a problem with the statement that ROC is appropiate for a balanced dataset (provided that we have a binary classification problem or the number of different classes is at least low).

23

PassionatePossum t1_it3rd0p wrote

If the examples between classes are strongly unbalanced, I would probably go for a precision/recall plot. One per class. Overall performance can be compared by the mean average precision.

You are right. In an overfitting classifier, training accuracy should go up over the long term. But that does not have to be a strong effect. I've seen plenty of overfitting classifiers where the training loss was essentially flat but the validation loss kept increasing. Also doesn't have to be a strong effect. But from what you told me, that makes my theory of overfitting slightly less likely.

Your explanation of the 128 units makes a lot more sense. However, I would argue to start simple. One dense layer after a sufficiently deep convolutional network, should be all that is needed.

I feel like you quest for "understanding" network structures is an unproductive direction. Well-performing network architectures are mostly just something that empirically works, there is not real theory behind it. You can waste a lot of time trying to tweak something that has been shown to work across a wide area of problem domains or you can just stick with something that you know works. Especially if you only need a ballbark estimate.

My setup for a ballpark estimate for pretty much any problem is:

  1. An EfficientNet as backbone. That has the advantage you can easily scale up the backbone if you have the resources and want to see what is possible with a larger network. I usually start with EfficientNet-B1.
  2. Pretrained imagenet weights (without the densely connected layers)
  3. Global average pooling on the features.
  4. A single dense layer to the output neurons.
  5. I usually train only the last layer for a single epoch and then release the weights for the backbone.

After I have the initial predictions. I try to visualize the error cases to see whether I can spot commonalities and work my way up from there.

That hasn't failed me so far. I normally usually use a focal loss to guard against strongly unbalanced examples. Unfortunately, the multi-class case isn't implemented in TensorFlow (which is what I tend to use), but that is easily implemented in a few lines of code.

But in your case I wouldn't go through the trouble of tweaking the loss. A normal crossentropy loss should be sufficient to get an idea of what is possible. If everything fails, downweight the loss on examples that are overrepresented.

1

PassionatePossum t1_it1jsw2 wrote

Too little information go on. I hope you have a training and an independent validation set (and by independent I don't mean different images of the same blood cell).

  1. Accuracy can be a highly misleading metric, especially if you have strong imbalances in the classes and number of examples.
  2. Increasing validation error when adding layers, can be a sign of overfitting. However don't train for a fixed number of epochs and then evaluate. Validate regularly during the training and take the best checkpoint.
  3. "the learning rate is crazy small" sets off alarm bells. You are aware that the learning rate is a parameter you need to set, right?
  4. You have a CNN but you also have a dense layer with 128 units while only having 17 classes. Something does not add up here.

As for the number of layers. There is no definite answer to this questions and it is also not that important. You might not get the best performance if you don't optimize it, but it should always sort of work. The problems you have are likely much more fundamental than that.

Classification of images is a well-studied problem. Why not start from existing and pretrained networks such as EfficientNet and build your own classifier on top of it?

5

PassionatePossum t1_irvbp94 wrote

If you are using CNNs, it is actually very straightforward to solve: You need different loss functions for every independent attribute. And the optimization objective is to minimize the (weighted) sum of these loss functions.

  1. Add a separate output layer for every independent attribute and a separate loss function to every output layer.
  2. During training, set target values for the unneeded layers to an arbitrary value and set the loss weights to zero.
1

PassionatePossum t1_irmr78w wrote

You seem to be quite new at this (no offense, but otherwise you wouldn't be asking for code for such a trivial task), I would like to give you some advice on how to do this right. Others have already told you how to implement a random split, which generally is good advice. However, the underlying assumption is, that the images themselves are not somehow correlated with one another.

I've actually seen people taking video frames (and of course every video frame doesn't look much different from the previous one) and randomly sample these frames into training/test sets and then bragging about their incredibly good performance. Of course any performance measurements you do on such a dataset will be worthless.

So how you want to sample training/test data is something you should think about carefully (i.e. are the training/validation/test set actually independent from one another).

So under the assumption, that the images are independent from one another a random split would be a good idea. If that isn't the case (and without more information, nobody here can tell you whether that is the case), you need some other way to split the data (e.g. by video).

3

PassionatePossum t1_iqvz2n0 wrote

Yeah, this has no chance of working. Neural networks aren't magic. They are function approximators, nothing more. And an neuron can only learn a linear combination of its inputs.

Since you only have one input, the first layer will only be able to learn fractions of the original input. And the second layer will learn how to add them together. So some non-linearities (due to activations) aside, your model can essentially only learn to add fractions of the original input.

And while the universal approximation theorem says that theoretically this is enough to approximate any function if you make your network wide or deep enough, you have no guarantees that the solver will actually find the solution. And in practice, it won't.

A common trick is to use (1, x, x^2, ..., x^n) as input but I doubt that this will do the trick in your case. If there is a function that describes a relationship between your input variable and the output variable, it has to be a polynomial of extremely high degree.

If you have additional inputs you could use, it might help. But just looking at what you have provided, it is not going to work.

2

PassionatePossum t1_iqvklvt wrote

Sorry, I still need a little more information. From the plots you have provided I would assume that you have a regression problem and you have a 1D input and a 1D target, correct? Or are we talking about a time series?

For the moment I'll go with assumption (1). The data you provided looks fairly random. I'm curious what function you want to use to model this? How does the network look like (how many layers) and what exactly are the inputs to your network (are they powers of your input variable or something else?)

1

PassionatePossum t1_iqldif6 wrote

I cannot speak for NeurIPS in particular. But most academic conferences are just there to know what is out there, not to gain a deep understanding of the topic. Every session (keynotes aside) is very rushed. Each speaker maybe gets 10 minutes to present his paper. That is usually enough to get a general idea what this paper is about but not nearly enough to really understand it. I usually just sit in these sessions and make a note that I this might be an interesting paper to read later.

And don‘t expect workshops to be a step by step introduction into the subject. You still need to have good general understanding of the subject to benefit from one. I‘ve also been at conferences where „workshop“ wasn‘t intended to be a workshop for participants but for the authors (which usually were new researchers). Those were papers with interesting approaches but not so good results. And the audience was encouraged to contribute ideas that the author could try.

Just to be sure: I‘m not saying that it wouldn‘t be worth attending such a conference. I‘m just saying that you should have the right expectations.

2