# data-preprocessing's questions - English 1answer

174 data-preprocessing questions.

### Reshaping data frame to long list of key:values [migrated]

0 answers, 12 views r data-preprocessing
I have a dataset row data in the following format: ...

### Reverse CARET proProcess()

I did the following steps in my modeling using R: 1)applied proProcess() function in CARET package and then encoded the data. 2)Used SMOTE to balance the data 3)Applied Mutlti linear regression for ...

### Preprocessing using Sentiwordnet

I want to perform a Sentiment-Analyses with SentiWordnet on Tweets. The analyses should be on the document level, i.e. I want to classify each Tweet as positive, negative or neutral. However, the ...

### Correlated variables - classification

I have some data set and need to use a few classification methods to make prediction. I first need to pre-process the data set. France is administratively divided in regions (13), and regions are ...

### Single words as input in a feed forward neural network?

Is there a way I can input single words, e.g. names, as input in a feed forward neural network? It has to be a Feed forward NN, so I guess I have to implement some sort of pre-processing, like ...

### 1 Cross Validation with Preprocessing (Normalization, Discretization, Feature Selection)

I am now trying to evaluate my model with cross validation. My dataset contains some numeric and nominal attributes. Here, I carry out the following data preprocessing tasks: A. Normalization: Min-...

### 6 How to prepare data for input to a sparse categorical cross entropy multiclassification model [closed]

So I have a set of Tweets with a few columns such as Date and the Tweet itself and a few more but I want to use 2 columns to build my model(Sentiment & Stock Price) Sentiment analysis is performed ...

### 1 pre processing category data in R

0 answers, 74 views r data-preprocessing
Example of data set ...

### 1 Should I use other scaling methods for pre-processing the data rather than normalizing or MinMaxScaling?

I have a weather parameter (daily volume of inflow for a river in million cubic meters, MCM) time series data as follows: I want to scale this data and feed it to a ...

### Balancing dataset and normalizing features: what comes first?

I have a unbalanced data set, that for the purpose of training I wish to balance. What is a better practice? First balancing the train data and then feature scaling/normalizing or using the mean and ...

### 1 Handling daily time series data for better accuracy

I have a daily observation of call volumes data starting from 28-01-2017 to 31-08-2018 a little over one and half year.On sundays calls volume are less and monday the highest showing weekly pattern. ...

### Improve Accuracy over Image dataset in Convolutional Neural Network

I a newbie to Convolutional Neural Networks and for some dataset I trained my ConvNet model and achieved some accuracy. Now, I increased some filters and one or two ConvNet layers and saw some ...

### 1 Handle Categorical Variables in Machine Learning in Python [closed]

I have $4$ variables in the data-set, each has more than $50$ levels in them. I want to include all these variables in my predictive model. How should I handle these categorical variables? If I do ...

### 5 Rule of thumb for using logarithmic scale

When I am given a variable, I usually decide whether to take its logarithm based on gut feeling. Usually I base it on its distribution - if it has long tail (like: salaries, GDP, ...) I use logarithms....

### How to pre-process features from different domains for Machine Learning models [duplicate]

Eventhough my question is applicable for all kind of models, I asked it in the scope of SVM for now. Assume I have 3 sentences in my training set and 2 sentence in my Test set. I would like to ...

### What is the best way to remove outliers in the task of forecasting demand?

I have dataset with over 2500 IDs of products. For each ID I need to remove outliers. I removed them by the condition, that everything what lies over 1.5 * IQR should be deleted, but it seems that ...

### 1 How to approach preprocessing large number features for machine learning?

I used to apply supervised machine learning for maximum few dozen "normal", natural features like human interpretable ones in Boston House Prices table. I usually try to understand each of them, think ...

### 1 Why we don't normalize the images?

I was watching the video from this stanford course on convolutional neural nets where the professor says (at 28:59) 'we do zero-mean the pixel values in image but we do not normalize the pixel values ...

### Is it beneficial and practical to perform the same moving average, 10 times (in pandas, df.rolling operation) on an already rolled df?

I have a precipitation data which is very sharp in some places and in my case not predictable for the LSTM predictive model which I'm currently working on. Here is ...

### Statistics: Eliminating the effect of one parameter

a quick statistics question. Consider having a conveyor band inside a black box. A test-object is pulled through the black box at different(known) speeds whilst gathering sensor data (temperature ...

### Data Scaling: is multi-dimension scaling equivalent to uni-dimension scaling for same range features?

I am given an $n \times m$ matrix $\mathbf R$ where $n$ is the number of $users$ and $m$ is the number of $items$ - this matrix is usually known as the rating matrix within recommender systems domain. ...

### 2 What preprocessing techniques work well for autoencoding audio?

I am wondering what preprocessing techniques work well for autoencoding audio data? Specifically I have a dataset of ~0.5 second audio samples of people pronouncing digits 0-9 (think an audio version ...

### 2 “We have to apply feature scaling on test set using scaling parameters from train set.” Is this statement true? If yes, why?

I understand that in standard data cleaning and pre-processing pipelines, we have to make sure that the information from the test set (or what would be the test set after splitting) does not leak into ...

### 1 Does it make sense to preprocess (normalise or standardise) this data for GAN?

I'm working on a project where I have a dataset for a dynamical system (pendulum) containing a trajectory, energy cost and corresponding control actions (See below). I'm using a generative adversarial ...

### List of machine learning classifiers that naturally assume data in normal distribution

I have searched about this question but no answer really made it comprehensive. To my knowledge, linear regression and most clustering algorithms naturally have the assumption that data need to be in ...

### Improving the performance of gradient descent during data preprocessing

I have been reading about the possible tricks to use for improving the performance of Gradient Descent in neural networks. I understand that there are two common methods: 1. Feature Scaling: Scale ...

### 1 Some convenient way to transform time series into stationary one

In this Machine Learning Mastery post we read that in order to predict time series using LSTM network, it is good to make the data stationary first and then scale it to the interval $(-1, 1).$ In ...

### Should you reshuffle your dataset after you use five or ten crop data augmentation in general machine learning?

My data was shuffled randomly first then I applied a five crop data augmentation. Now my batch went from [8, 3, 256, 256] to ...

### Applicability of Queueing models for data pipelining?

I have to write a system that has several data processing steps, each varying in time complexity. I intend to make use of buffer queues and multithreading for each preprocessing step. I have limited ...

### 3 Can a ConvNet see patterns that a human cannot?

I am training a ConvNet to detect different types of stripes in my images. As I am working on astronomical images, my pixel values are flux densities and therefore represent ground truth data. When I ...

### 2 Should I normalize featurewise or samplewise

It might be a beginner question, but I'm not sure how to normalize my data. Let's suppose I have a NxM matrix with N samples of M dimensions each. If I want to normalize my data I can do it in two ...

### NN works well on preproccessed data, but results are poor after de-normalizing - predicting non-normalized prediction values using NN?

1 answers, 26 views neural-networks data-preprocessing
So I'm trying to build a neural network to fit an approximating function to a data set. The data set consists of (input,output) paris where the input features are discrete numbers [0,1,2,...,1027]. ...

### How do I look at my data and decide the best preprocessing or model tuning steps to take?

I am fairly new to machine learning and especially the math (stats) behind it. I have a question asked here I have not only tried the methods suggested in that question, but also with Orange gui ...

### 1 Missing data imputation in time series in R

I have got hourly temperature data from 2012 to 2016 as follows: ...

### 1 Preprocessing Target for Survival Analysis

I am wondering if it is recommendable to preprocess the target for Survival Analysis, regression. Lets say the target is a time-to-event, for example the duration in days for process tasks. I plot a ...

### 4 Modelling nonnegative integer time series

1 answers, 165 views time-series arima data-preprocessing
I have a time series of integer values $x \geqslant 0$. I would like to model it using, say, ARIMA, or Holt-Winters. How do I properly preprocess it for the task? I tried log-transform of \$x' = x + ...

### Best preprocessing method for my regression task

I have a dataset:X Y 123 321 42 24 10 01As you can see, Y is just the reverse of the input number, nothing special. The issue ...

### 1 What is the L Norm used for in layman terms in reguards to a Vector?

So I'm having a tough time trying to understand Vectors coming from a CS background, I am reading some books and reading through websites but it still a bit to a low level for me, I need a bit more ...

### How to handle certain types of anomalies in the input when using CNN

I am trying to perform binary prediction in a problem where the measurements come from objects that are spatially ordered as a grid and there is a physical meaning to the neighborhood on the grid. I ...

### 2 How to deal when you have too many outliers?

I have attached the boxplot of a variable called Fare(of a journey). This is a continuous variable which has outliers. According to some articles on outliers, I learned that any data point that is ...

### 1 Define population without introducing bias [closed]

I am working with a large amount of firm data where every variable is highly skewed as there are a large number of extremely small firms and a small number of huge ones. I am interested in defining ...

### Standardisation in Naive Bayes?

Is it possible that the accuracy of Naive Bayes remain the same even after applying Standardisation . I have applied 2 Standardisation techniques : Min Max Scaling ( which squishes the range from 0-1 ...

### Is it reasonable to exclude negative observation with a big distance to the positive ones?

I am working on classificating a high dimensional binary sparse dataset. Is it a good idea measure the distance (via Hemming, Jaccard, etc.) of every negative class observation to every positve one ...

### Preprocessing+hyperparameter selection: nested or nested nested cross validation?

I've studied many questions and answers on the theme of nested cross-validation. I understand why we need it and how I can, after that part, find the optimal hyperparameters and any other things I'm ...

### 1 Using non-normalized data for learning a RL agent using PPO

i want to use the baselines code from OpenAI to apply to a power trading setting where I trade energy on a market. My observation space includes several kinds of data, which is why I originally used ...

### How to encode timestamp features toward better meaningful features

I have the dataset from here which contains the following features:'Index', 'Arrival_Time', 'Creation_Time', 'x', 'y', 'z', 'User', 'Model', 'Device', 'gt' A ...

### 4 Imputation of missing data before or after centering and scaling?

I want to impute missing values of a dataset for machine learning (knn imputation). Is it better to scale and center the data before the imputation or afterwards? Since the scaling and centering ...

### Worse accuracy with input normalization (NNs)

I am training a neural network for audio classification. My inputs are "1-channel images" of size 60x130x1. Surprisingly, I always get better accuracy when training the model with the original data, ...

### 11 Neural Nets: One-hot variable overwhelming continuous?

I have raw data that has about 20 columns (20 features). Ten of them are continuous data and 10 of them are categorical. Some of the categorical data can have like 50 different values (U.S. States). ...

### 2 Downside to scaling and centering?

Bottom line up front: is there any reason not to center and scale continuous variables prior to model fitting for the sake of conducting model comparison? I'm conducting a model comparison on a large ...