# split a large file to new file with unique file names

kapr0001 just a moment. 1 answers, 0 views

I need to split a file into unique file names.
I can do it with sed command eg, sed -n '/scaffold135_/w 135-scaf.txt' input file.txt but it's time consuming so I need a smart way to do it faster. Below is an input sample (the original file has one million lines):

scaffold1_115,T,N,N,N,N,A,N,N,N,N,N,N,T,N,T,T,N,A,A,N,N,A
scaffold1_123,A,N,N,N,N,G,N,N,N,N,N,N,A,N,A,A,N,G,G,N,N,G
scaffold1_140,C,N,N,N,N,C,N,N,N,N,N,N,C,N,C,C,N,T,C,N,N,C
scaffold2_161,G,N,N,N,N,G,N,C,N,N,C,N,G,N,G,G,N,G,G,C,N,G
scaffold2_162,C,N,N,N,N,C,N,T,N,N,T,N,C,N,C,C,N,C,C,T,N,C
scaffold2_180,C,N,N,N,N,C,N,T,N,N,C,C,C,T,C,C,T,C,C,C,N,C
scaffold2_194,C,N,N,C,N,C,C,C,C,C,C,C,C,C,T,C,C,C,C,C,N,C
scaffold3_195,G,N,N,G,G,C,G,G,G,G,G,G,C,G,C,G,G,C,C,G,N,C
scaffold3_234,T,N,A,T,A,A,T,T,T,A,T,A,A,T,A,A,T,A,A,T,N,A
scaffold101_282,C,T,T,T,C,C,T,C,T,C,C,C,C,T,C,C,T,C,C,C,N,C
scaffold101_371,T,T,T,T,T,C,T,T,T,T,T,T,T,T,T,T,T,T,T,T,N,C
scaffold101_372,T,T,T,T,C,C,T,T,T,T,T,T,T,T,T,T,T,T,T,T,N,C

The lines are unique. I want lines specific to each scafold into a separate file, say all lines that start with scaffold1_ into a file named scaffold1.txt and so on until scaffold10156.txt which contains the lines starting with scaffold10156_

iruvar 09/21/2015.

You should be able to use redirection with awk

awk -F'_' '{print > $1".txt"}' file If lines sharing the scaffoldn_ prefix are contiguous, you could do the following to avoid breaching open file handles limit awk -F'_' 'NR == 1 ||$1 != prev{if (f) close(f);f=$1".txt"; prev=$1};
{print > f};END{if (f) close(f)}' file