# data-preprocessing's questions - English 1answer

157 data-preprocessing questions.

### 5 How to prepare data for input to a sparse categorical cross entropy multiclassification model

So I have a set of Tweets with a few columns such as Date and the Tweet itself and a few more but I want to use 2 columns to build my model(Sentiment & Stock Price) Sentiment analysis is performed ...

### Balancing dataset and normalizing features: what comes first?

I have a unbalanced data set, that for the purpose of training I wish to balance. What is a better practice? First balancing the train data and then feature scaling/normalizing or using the mean and ...

### 2 Modelling nonnegative integer time series

1 answers, 75 views time-series arima data-preprocessing

### None Values or Missing Data

Suppose I have a feature "Pool Size" whose levels are: Big, Medium, Small. Surely if a house doesn't have a pool it certainly won't have a value for "Pool Size". Q1-For this particular instance/...

### 1 Pre-process geographical data for Machine Learning

I'm working with a dataset containing crimes data from Chicago. There's a lot of geographical data, and I'm looking for advice on pre-processing. We have qualitative variables represented by integers,...

### 1 Should classes be balanced before or after splitting into sets?

I've split my data and performed pre-processing. I ran some basic classifiers on it and got accuracies within 70-80%, which to me seems fairly low. One thing I didn't do was balance my classes before ...

### When is clipping/winsorizing a good idea?

0 answers, 70 views data-preprocessing winsorizing
I have a dataset that I will use for training a logistic regression model with regularization, where one of the features can sometimes take on extreme values. The feature describes a ratio with most ...

### 2 features preprocessing in model building

As I read many cases of "standardization",there are some opinions conflict with them, e.g. some cases will add lag features and some of these features are created by other original features and it ...

### 2 Should I remove any out-liers before splitting the data?

1 answers, 107 views outliers data-preprocessing
I've split my data into three sets before doing any pre-processing; training, validation and testing. I thought that any pre-processing tasks have to take place after splitting the data. However, some ...

### Should I apply PCA on my entire dataset or just the data converted from nominal?

1 answers, 78 views pca python data-preprocessing
I have a data-set with a decent number of attributes, half of them nominal. I used a binary vectorizer to convert the nominal attributes to numerical, but now there are far too many of them. I'm not ...

### How to create training and testing set for data from two resources?

0 answers, 16 views machine-learning data-preprocessing
I have two gene expression data set from two different labs. One contains about 2000 sample, the other has about 100 sample. I believe they got the data based on different biotechnology methods. Now ...

### Pre-Processing - Applied on all three (training/validation/test) sets?

From what I understand from previously answered questions, you're meant to do your pre-processing on each set after splitting your data into training and test sets. But I'm not sure where the ...

### 1 How can I standarize/normalize my categorical, factorized features in outliers detection problem?

I'm working on anomaly detection in CTU-13 dataset. Records are labeled and there are a few categorical features with many categories (for example one of the features "State" has over 250 possible, ...

### 1 Preparing data to apply machine learning algorithms for times series

I have daily time series financial data. I want to apply machine learning techniques to predict expected returns. To do this, I have first transformed the data so that I could take into account time ...

### On what basis can we combine levels in a factor variable when the target variable is binary? [duplicate]

I am working on a dataset in which a variable has following levelsLevels: 0 1 2 3 4 5 8 Frequency: 608 209 28 16 18 5 7The target variable is binary....

### Is there a difference between normalizing vs. percentages when input data represents counts?

For example, say I have an input vector[0, 10, 0, 10, 20] representing 0 of item 1, 10 of item 2, etc. If I want to train on this data, is there some intuitive ...

### 1 Deep learning - effects of preprocessing

1 answers, 41 views deep-learning data-preprocessing
I am new to the field of deep learning, and I was wondering, whether there exist any theorems/laws, which govern how various preprocessing techniques effect the learning process. I saw in some models,...

### 2 In this classification dataset preprocessing, should outliers be removed before/after reducing dimensionality?

I have a classification dataset with 148 input (independent) features, most of which are expected beforehand to be irrelevant. So, at the moment, I am using feature selection methods to discard the ...

### Missing value imputation for categorical variables using median in caret R

I want to know how caret Preprocess() in R handles with categorical missing values by median imputing. I think Knn imputation is doing dummy coding for categorical values, how about median imputation? ...

### Feature scaling and when to use which

I am looking into running regression on a multivariate data set. I am looking into different ways to scale my data: standardization, L2 and L1 normalizations. In what case would you use which method? ...