**166 data-preprocessing questions.**

This question goes mainly out to the data scientist and data engineers that have applied experience, although anyones 2cents are appreciated.
Intro:
I have a production ready model that has been ...

I have DNN with a binary output layer of positive and negative. The dataset has a very low number of positives. Therefore trading the model is difficult.
Currently I plan on using RBM's or Naive ...

I have a big data set to analyze. It has 20 columns corresponding to different parameters, and >20k rows corresponding to observations at different time points. I would like to filter-out data ...

I have some data set and need to use a few classification methods to make prediction. I first need to pre-process the data set.
France is administratively divided in regions (13), and regions are ...

This is more of a general question and not ml-algorithm specific, are there any algorithms/tools/papers on the topic of 'selecting' training-data-entries to maximise the accuracy/quality of ...

I'm working on a project where I have a dataset for a dynamical system (pendulum) containing a trajectory, energy cost and corresponding control actions (See below). I'm using a generative adversarial ...

I have searched about this question but no answer really made it comprehensive. To my knowledge, linear regression and most clustering algorithms naturally have the assumption that data need to be in ...

I have been reading about the possible tricks to use for improving the performance of Gradient Descent in neural networks. I understand that there are two common methods:
1. Feature Scaling: Scale ...

I am now trying to evaluate my model with cross validation.
My dataset contains some numeric and nominal attributes.
Here, I carry out the following data preprocessing tasks:
A. Normalization: Min-...

So I have a set of Tweets with a few columns such as Date and the Tweet itself and a few more but I want to use 2 columns to build my model(Sentiment & Stock Price) Sentiment analysis is performed ...

I have a unbalanced data set, that for the purpose of training I wish to balance. What is a better practice? First balancing the train data and then feature scaling/normalizing or using the mean and ...

In this Machine Learning Mastery post we read that in order to predict time series using LSTM network, it is good to make the data stationary first and then scale it to the interval $(-1, 1).$
In ...

My data was shuffled randomly first then I applied a five crop data augmentation. Now my batch went from [8, 3, 256, 256] to ...

I have to write a system that has several data processing steps, each varying in time complexity. I intend to make use of buffer queues and multithreading for each preprocessing step. I have limited ...

I am training a ConvNet to detect different types of stripes in my images. As I am working on astronomical images, my pixel values are flux densities and therefore represent ground truth data.
When I ...

It might be a beginner question, but I'm not sure how to normalize my data.
Let's suppose I have a NxM matrix with N samples of M dimensions each. If I want to normalize my data I can do it in two ...

So I'm trying to build a neural network to fit an approximating function to a data set.
The data set consists of (input,output) paris where the input features are discrete numbers [0,1,2,...,1027].
...

I am fairly new to machine learning and especially the math (stats) behind it. I have a question asked here
I have not only tried the methods suggested in that question, but also with Orange gui ...

I have got hourly temperature data from 2012 to 2016 as follows:
...

I am wondering if it is recommendable to preprocess the target for Survival Analysis, regression. Lets say the target is a time-to-event, for example the duration in days for process tasks.
I plot a ...

I have a time series of integer values $x \geqslant 0$. I would like to model it using, say, ARIMA, or Holt-Winters. How do I properly preprocess it for the task? I tried log-transform of $x' = x + ...

I have a dataset:X Y
123 321
42 24
10 01As you can see, Y is just the reverse of the input number, nothing special. The issue ...

So I'm having a tough time trying to understand Vectors coming from a CS background, I am reading some books and reading through websites but it still a bit to a low level for me, I need a bit more ...

I am trying to perform binary prediction in a problem where the measurements come from objects that are spatially ordered as a grid and there is a physical meaning to the neighborhood on the grid.
I ...

I have attached the boxplot of a variable called Fare(of a journey). This is a continuous variable which has outliers. According to some articles on outliers, I learned that any data point that is ...

I am working with a large amount of firm data where every variable is highly skewed as there are a large number of extremely small firms and a small number of huge ones.
I am interested in defining ...

Is it possible that the accuracy of Naive Bayes remain the same even after applying Standardisation . I have applied 2 Standardisation techniques :
Min Max Scaling ( which squishes the range from 0-1 ...

I am working on classificating a high dimensional binary sparse dataset. Is it a good idea measure the distance (via Hemming, Jaccard, etc.) of every negative class observation to every positve one ...

I've studied many questions and answers on the theme of nested cross-validation. I understand why we need it and how I can, after that part, find the optimal hyperparameters and any other things I'm ...

i want to use the baselines code from OpenAI to apply to a power trading setting where I trade energy on a market. My observation space includes several kinds of data, which is why I originally used ...

I have the dataset from here which contains the following features:'Index', 'Arrival_Time', 'Creation_Time', 'x', 'y', 'z', 'User', 'Model', 'Device', 'gt' A ...

I want to impute missing values of a dataset for machine learning (knn imputation). Is it better to scale and center the data before the imputation or afterwards?
Since the scaling and centering ...

I am training a neural network for audio classification. My inputs are "1-channel images" of size 60x130x1.
Surprisingly, I always get better accuracy when training the model with the original data, ...

I have raw data that has about 20 columns (20 features). Ten of them are continuous data and 10 of them are categorical. Some of the categorical data can have like 50 different values (U.S. States). ...

Bottom line up front: is there any reason not to center and scale continuous variables prior to model fitting for the sake of conducting model comparison?
I'm conducting a model comparison on a large ...

I have a short question regarding pre-processing and normalization of multivariate time-series data which is used for 1 step ahead forecasting employing different neural network architectures.
More ...

Background
I run a website that, among other things, crowd-sources data for an online video game's economy. Prices are shared/the same among all players, but because the game developer does not ...

I am currently working on a dataset containing feature vector words.The feature vector consists of ordinal as well as binary data type, majority of them are binary data. for eg ( F,T,F,T,T,36).
How do ...

I preprocessed my data by calculating z-scores of each feature, and trained a Nu-SVR model for a regression.
While preprocessing, I expanded some features' scale up to 10~30 fold by mistake,
and soon ...

I'm attempting to set up a One-class SVM for detecting anomalous DNS traffic based on a training set of normal 'clean' data. I've got a pretty strong grasp on the SVM itself and how to set it up but I'...

I've been reading a bunch of posts that advise people to not include test data when preprocessing. So I've proceeded by first setting aside a test dataset to be used to assess how well my classifier ...

I just have a basic question about what 'best practice' generally is, in such a situation:
Suppose I have two finite time series of equal length $\{x(t)\}_{t \in I}$ and $\{y(t)\}_{t \in I}$ and say ...

I was trying to replicate the results of this paper. The paper suggest to pre-process data in two parts . The paper proposes a hybrid technique for short-term load forcasting.
1) They have used ...

I am using a good volume of time series data that spans over two months [November and December 2015] containing time-stamp observations. A total of about 6 million samples. I use the portion of clean ...

if I encode my data in 2-grams, thats (26+26+10)^2~3800 possible pairs
if I use 3-grams that's ~200,000 possible triplets. We can reduce this number by using only lower case, but basic combinatorics ...

Should the data be whitened in pre-processing stage when applying One-Class SVM method to detect outliers ?
Whitening makes it so that the variance in each dimension is the same(if I have understood ...

I'm working on a classification problem whose features are very noisy. I have a table with the 'official' feature levels, but the actual data loosely resemble them. For example, to represent a value ...

Given some whitening transform, we change some vectors $\textbf{x}$, where features are correlated, into some vector $\textbf{y}$, where components are uncorrelated. Then we run some learning ...

Suppose I have a feature "Pool Size" whose levels are: Big, Medium, Small.
Surely if a house doesn't have a pool it certainly won't have a value for "Pool Size".
Q1-For this particular instance/...

I'm working with a dataset containing crimes data from Chicago. There's a lot of geographical data, and I'm looking for advice on pre-processing.
We have qualitative variables represented by integers,...

- machine-learning
- classification
- data-transformation
- neural-networks
- normalization
- categorical-data
- dataset
- regression
- feature-selection
- r
- time-series
- python
- svm
- standardization
- pca
- missing-data
- scikit-learn
- outliers
- deep-learning
- data-imputation
- caret
- correlation
- mathematical-statistics
- cross-validation
- unbalanced-classes

- Bright Apprentice not being taken seriously
- Get "edge numbers" from list
- Does this triangle-area theorem have a name?
- What is the need of assumptions in linear regression?
- What cases were used in compounds?
- Count Like a Babylonian
- How should I store "unknown" and "missing" values in a variable, while still retaining the difference between "unknown" and "missing"?
- Can I Misty Step into Midair?
- Do Italian wizards go to Beauxbatons?
- Why is storing passwords in version control a bad idea?
- What do you call the act of drinking a whole bottle of, say, water in one go?
- What is a person being envied called?
- Do I draw a card when I resolve an enchantment with Eidolon of Blossoms, Opalescence, and Torpor Orb in play?
- How early can transparent glass windows be easily available?
- Why is the shape of a hanging chain not a "V"?
- Is the discovery of tuff amongst dinosaur fossils in Utah incongruous scientific consensus about dating of the fossils?
- How to express pouring to the very top of a receptacle?
- How could a public DNS server return bad results
- How to design an advanced deadman's switch for the "vitality impaired"?
- How do I show concern to my manager, who is coming back from an emergency leave? Want to ask him if everything was good back at home
- Why did I have to wave my hand in front of my ID card?
- When is Houstonâ€™s Hurricane Season?
- 'The Chosen One' paradox
- Is there any etiquette about how to proceed when a technical problem leads to a misplay in an online game?