**157 data-preprocessing questions.**

So I have a set of Tweets with a few columns such as Date and the Tweet itself and a few more but I want to use 2 columns to build my model(Sentiment & Stock Price) Sentiment analysis is performed ...

I have a unbalanced data set, that for the purpose of training I wish to balance. What is a better practice? First balancing the train data and then feature scaling/normalizing or using the mean and ...

I have a time series of integer values $x \geqslant 0$. I would like to model it using, say, ARIMA, or Holt-Winters. How do I properly preprocess it for the task? I tried log-transform of $x' = x + ...

I am trying to perform binary prediction in a problem where the measurements come from objects that are spatially ordered as a grid and there is a physical meaning to the neighborhood on the grid.
I ...

Say we have a dataset with both continuous and discrete features; for example in the classic House Sales in King County on Kaggle, there are features such as ...

I have attached the boxplot of a variable called Fare(of a journey). This is a continuous variable which has outliers. According to some articles on outliers, I learned that any data point that is ...

I have some data set and need to use a few classification methods to make prediction. I first need to pre-process the data set.
France is administratively divided in regions (13), and regions are ...

I am now trying to evaluate my model with cross validation.
My dataset contains some numeric and nominal attributes.
Here, I carry out the following data preprocessing tasks:
A. Normalization: Min-...

I am working with a large amount of firm data where every variable is highly skewed as there are a large number of extremely small firms and a small number of huge ones.
I am interested in defining ...

Is it possible that the accuracy of Naive Bayes remain the same even after applying Standardisation . I have applied 2 Standardisation techniques :
Min Max Scaling ( which squishes the range from 0-1 ...

I am working on classificating a high dimensional binary sparse dataset. Is it a good idea measure the distance (via Hemming, Jaccard, etc.) of every negative class observation to every positve one ...

I've studied many questions and answers on the theme of nested cross-validation. I understand why we need it and how I can, after that part, find the optimal hyperparameters and any other things I'm ...

i want to use the baselines code from OpenAI to apply to a power trading setting where I trade energy on a market. My observation space includes several kinds of data, which is why I originally used ...

I have the dataset from here which contains the following features:'Index', 'Arrival_Time', 'Creation_Time', 'x', 'y', 'z', 'User', 'Model', 'Device', 'gt' A ...

My data is a multivariate time series of both numeric and categorical data. Like xit = [283, 43, 56, 'Blue', 'Choice A'] for each ID i and time step t. I'm trying to perform classification by feeding ...

I want to impute missing values of a dataset for machine learning (knn imputation). Is it better to scale and center the data before the imputation or afterwards?
Since the scaling and centering ...

I am training a neural network for audio classification. My inputs are "1-channel images" of size 60x130x1.
Surprisingly, I always get better accuracy when training the model with the original data, ...

I have raw data that has about 20 columns (20 features). Ten of them are continuous data and 10 of them are categorical. Some of the categorical data can have like 50 different values (U.S. States). ...

Bottom line up front: is there any reason not to center and scale continuous variables prior to model fitting for the sake of conducting model comparison?
I'm conducting a model comparison on a large ...

I have a short question regarding pre-processing and normalization of multivariate time-series data which is used for 1 step ahead forecasting employing different neural network architectures.
More ...

Background
I run a website that, among other things, crowd-sources data for an online video game's economy. Prices are shared/the same among all players, but because the game developer does not ...

I am currently working on a dataset containing feature vector words.The feature vector consists of ordinal as well as binary data type, majority of them are binary data. for eg ( F,T,F,T,T,36).
How do ...

I preprocessed my data by calculating z-scores of each feature, and trained a Nu-SVR model for a regression.
While preprocessing, I expanded some features' scale up to 10~30 fold by mistake,
and soon ...

I'm attempting to set up a One-class SVM for detecting anomalous DNS traffic based on a training set of normal 'clean' data. I've got a pretty strong grasp on the SVM itself and how to set it up but I'...

This is more of a general question and not ml-algorithm specific, are there any algorithms/tools/papers on the topic of 'selecting' training-data-entries to maximise the accuracy/quality of ...

I've been reading a bunch of posts that advise people to not include test data when preprocessing. So I've proceeded by first setting aside a test dataset to be used to assess how well my classifier ...

I just have a basic question about what 'best practice' generally is, in such a situation:
Suppose I have two finite time series of equal length $\{x(t)\}_{t \in I}$ and $\{y(t)\}_{t \in I}$ and say ...

I was trying to replicate the results of this paper. The paper suggest to pre-process data in two parts . The paper proposes a hybrid technique for short-term load forcasting.
1) They have used ...

I am using a good volume of time series data that spans over two months [November and December 2015] containing time-stamp observations. A total of about 6 million samples. I use the portion of clean ...

if I encode my data in 2-grams, thats (26+26+10)^2~3800 possible pairs
if I use 3-grams that's ~200,000 possible triplets. We can reduce this number by using only lower case, but basic combinatorics ...

Should the data be whitened in pre-processing stage when applying One-Class SVM method to detect outliers ?
Whitening makes it so that the variance in each dimension is the same(if I have understood ...

I'm working on a classification problem whose features are very noisy. I have a table with the 'official' feature levels, but the actual data loosely resemble them. For example, to represent a value ...

Given some whitening transform, we change some vectors $\textbf{x}$, where features are correlated, into some vector $\textbf{y}$, where components are uncorrelated. Then we run some learning ...

Suppose I have a feature "Pool Size" whose levels are: Big, Medium, Small.
Surely if a house doesn't have a pool it certainly won't have a value for "Pool Size".
Q1-For this particular instance/...

I'm working with a dataset containing crimes data from Chicago. There's a lot of geographical data, and I'm looking for advice on pre-processing.
We have qualitative variables represented by integers,...

I've split my data and performed pre-processing. I ran some basic classifiers on it and got accuracies within 70-80%, which to me seems fairly low. One thing I didn't do was balance my classes before ...

I have a dataset that I will use for training a logistic regression model with regularization, where one of the features can sometimes take on extreme values. The feature describes a ratio with most ...

As I read many cases of "standardization",there are some opinions conflict with them, e.g.
some cases will add lag features and some of these features are
created by other original features and it ...

I've split my data into three sets before doing any pre-processing; training, validation and testing. I thought that any pre-processing tasks have to take place after splitting the data. However, some ...

I have a data-set with a decent number of attributes, half of them nominal. I used a binary vectorizer to convert the nominal attributes to numerical, but now there are far too many of them. I'm not ...

I have two gene expression data set from two different labs. One contains about 2000 sample, the other has about 100 sample. I believe they got the data based on different biotechnology methods.
Now ...

From what I understand from previously answered questions, you're meant to do your pre-processing on each set after splitting your data into training and test sets. But I'm not sure where the ...

I'm working on anomaly detection in CTU-13 dataset. Records are labeled and there are a few categorical features with many categories (for example one of the features "State" has over 250 possible, ...

I have daily time series financial data. I want to apply machine learning techniques to predict expected returns. To do this, I have first transformed the data so that I could take into account time ...

I am working on a dataset in which a variable has following levelsLevels: 0 1 2 3 4 5 8
Frequency: 608 209 28 16 18 5 7The target variable is binary....

For example, say I have an input vector[0, 10, 0, 10, 20] representing 0 of item 1, 10 of item 2, etc.
If I want to train on this data, is there some intuitive ...

I am new to the field of deep learning, and I was wondering, whether there exist any theorems/laws, which govern how various preprocessing techniques effect the learning process.
I saw in some models,...

I have a classification dataset with 148 input (independent) features, most of which are expected beforehand to be irrelevant. So, at the moment, I am using feature selection methods to discard the ...

I want to know how caret Preprocess() in R handles with categorical missing values by median imputing. I think Knn imputation is doing dummy coding for categorical values, how about median imputation? ...

I am looking into running regression on a multivariate data set. I am looking into different ways to scale my data: standardization, L2 and L1 normalizations.
In what case would you use which method? ...

- machine-learning
- classification
- data-transformation
- neural-networks
- normalization
- categorical-data
- dataset
- regression
- time-series
- svm
- feature-selection
- r
- pca
- deep-learning
- missing-data
- scikit-learn
- standardization
- python
- outliers
- data-imputation
- caret
- correlation
- cross-validation
- unbalanced-classes
- centering

- I prepared my CV in LaTeX and exported it to PDF. How to deal with a recruiter who insists on CV in Word format?
- PhD student failing
- What does "vergeben" relationship status mean?
- Cron job to delete files older than x days?
- Why doesn't this example of basis change work?
- Priority Queues - Array based implementation
- Why would radio signals be invented/transmitted when communication is not an issue?
- Is momentary physical access dangerous?
- What stops Google from saving all the information on my computer through Google Chrome?
- Can I see a Hidden creature that is not obscured at all?
- What is on/in my lens?
- What’s a possible one-word replacement for “applicable in every situation”?
- How to set personal hygiene standards with my boyfriend?
- Which passport to use? EU+USA dual national entering EU together with a non-EU family member
- "Integral Milking:" Does anyone else do this?
- How a led in a ledstrip like ws2811 know when it should be on or off
- Should I stop doing cardio?
- Is it really mandatory to practise sight reading at the piano?
- How can I handle my father challenging everything I say?
- Swim across to the other side of a river
- Why is `.catch(err => console.error(err))` discouraged?
- Childhood friend wants to get a "best friends" tattoo but I do not think it is a good idea
- Why check for file existence before sourcing it?
- Listing methods to prove that two groups are not isomorphic