Tech Support Guy banner
Status
Not open for further replies.

Split large csv file into smaller ones while keeping the column headers

1 reading
17K views 15 replies 4 participants last post by  Keebellah  
#1 ·
Hi there,

I found this topic which actually solved my problem on splitting the csv file into multiple ones, split a 100000 line csv into 5000 line csv files with DOS batch, however it is already closed so I decided to open a new one.

The solution given to the previous thread was this and it actually worked:

=======================================================
@echo off
setlocal ENABLEDELAYEDEXPANSION
REM Edit this value to change the name of the file that needs splitting. Include the extension.
SET BFN=BigFile.csv
REM Edit this value to change the number of lines per file.
SET LPF=5000
REM Edit this value to change the name of each short file. It will be followed by a number indicating where it is in the list.
SET SFN=SplitFile

REM Do not change beyond this line.

SET SFX=%BFN:~-3%

SET /A LineNum=0
SET /A FileNum=1

For /F "delims==" %%l in (%BFN%) Do (
SET /A LineNum+=1

echo %%l >> %SFN%!FileNum!.%SFX%

if !LineNum! EQU !LPF! (
SET /A LineNum=0
SET /A FileNum+=1
)

)
endlocal
Pause
=======================================================

However, after splitting the file, I noticed that only the first split had the column headers while the remaining ones does not have them. Is it possible to do that through batch? If yes, what code can I include in the code given above?

Thanks.
 
#2 ·
Can be done.
Going to assume your data doesn't have an EQUALS sign anywhere in it because the code you are using delimits the file by an = sign. This would make all the data after an equals sign disappear in your output. The delimiters should be set to nothing.

So basically what I would do is hold the very first line in your big file which is your header record into a separate variable. Then use an IF statement to see if your LineNum variable is equal to 1. When it is equal to 1 echo that variable to the output file. Of course you will have to a flag so that it doesn't output the header twice the very first time.

Technically your code does not output 5000 "records" to the first file because it is counting the header record as a line. Currently the code will have 4999 records with data and 1 header record in the first split file and the spilt files after that will all have 5000 records.

So you will need to ask yourself. Do you want 5000 data records in each file plus the header record to make each split file 5001 lines or do you want 4999 data records plus the header record to make each split file 5000 lines? That may require some additional coding as well.
 
#8 ·
I'm sure the first file will have the header, all you need is a bacth to take all the other files (the names will be related as far as I can make out of the tool) and insert the header into each file?
 
Save
#10 ·
Thank you all for your response, however, in as much as I would like to use ckphilli's suggestion, I can not do so because of some licensing issues even if it is a freeware that is why I was hoping to do it using a batch file.
 
#11 ·
Its not only freeware it is open source. There is no licensing issue! And because this is a compiled executable you will find that it runs 10 times faster than a batch file. Batch was never meant to process large text files.

Trust me on this. I do data processing for the largest printing company in the world. This is a no brainer! There should be no issue using it.
 
#13 ·
I'll see if my company will allow it. Many thanks to all, will keep this open for now and mark it as solved if the company will allow using it. Thanks again :)
It doesn't even require an install. It is a self contained executable.

Trust me when I say splitting large text files with BATCH is extremely slow. I tried writing several batch files in the past to do certain things and it would take hours to process a million or so records. I just ran that utility and it took probably less than 30 seconds to process 266,000 lines.

PowerShell can do it faster but I haven't gotten around to learning everything about it to do what I need it to do. There is no substitute for a compiled executable.
 
#14 ·
It doesn't even require an install. It is a self contained executable.

Trust me when I say splitting large text files with BATCH is extremely slow. I tried writing several batch files in the past to do certain things and it would take hours to process a million or so records. I just ran that utility and it took probably less than 30 seconds to process 266,000 lines.

PowerShell can do it faster but I haven't gotten around to learning everything about it to do what I need it to do. There is no substitute for a compiled executable.
Additionally, writing a macro to process this much data will SMOKE your machine. When I was dealing with this kind of thing I tried the macro route before I found this on a stout machine (fast processor, lots of memory) and it was horrible.
 
#15 ·
Additionally, writing a macro to process this much data will SMOKE your machine. When I was dealing with this kind of thing I tried the macro route before I found this on a stout machine (fast processor, lots of memory) and it was horrible.
Exactly. I found that out the hard way as well. When you use a file in a FOR Loop in a batch file, the whole file gets loaded into memory. I only got 1GB of memory on my system. So anytime I encountered a file larger than around 600MB, it would give me a memory error and not process the file.
 
Status
Not open for further replies.
You have insufficient privileges to reply here.