(We have also published a Simple Data Conversion Tutorial.)
Data Conversion is the generic term given to the process of converting computer data between different applications and/or between different computers. Data Conversion usually also involves Media Conversion -- converting the files from one type of tape or disk to another.
Data Conversion is far more complex than this brief article can address, so we have referenced additional detailed articles at the end of this article. This article assumes the files will be exchanged via tape, which is by far the most common method.
That's our business! |
The type of tape does not always indicate the physical recording format, and therefore the drive you need. For example, a DLT IV tape is used in DLT 4000, 7000, and 8000 drives, and in Benchmark DLT-1 and VS-80 drives, and they all write different numbers of tracks and densities. Likewise, an 8mm 112M tape could be written in 8200 format without compression, 8200C compressed format, 8500 uncompressed format, or 8505 compressed format. You simply can't tell by the type of tape what the recording format is. This is true of many tapes.
There are hundreds of programs used to write files to tape, and each one does it differently. In nearly all cases more than just the raw data is written to the tape. The tape program, such as a backup program, creates a data structure on the tape -- sort of like a container to aid in storing and retrieving the files -- then places your files within that structure. This data structure is unique to the backup program used, and to retrieve your files from the tape you will need to extract them from that structure using the same program to read the tape as was used to write the tape.
In some cases, such as IBM mainframe tapes, the tape format is dictated by the file type. A file with fixed-length records will dictate a fixed-block (FB) tape, whereas a file with variable-length records will dictate a variable-block (VB) tape. In other cases, such as UNIX tar and PC backup programs, files are written to tape in the same way, regardless of the type of file.
Tape programs vary widely, but each platform has some common methods. Here are a few:
IBM Mainframes
IBM Mainframe computers usually write an "IBM Standard Label (SL) tape". This writes a small file called a "label" before and after each data file. The first label defines the file that follows; its name, type, date, etc., and the second label repeats that information and also confirms the block count. We have published several articles on IBM Mainframe tapes; see our TechTalk Index.
ASCII Mainframes
Mainframe computers that operate in ASCII, like CDC, etc., usually write an "ANSI Standard Label (SL) tape". The ANSI SL tape format is very similar to the IBM SL tape format, but both the labels and the data are in ASCII.
IBM AS/400
IBM AS/400 computers running the OS/400 operating system can write IBM SL format, but generally write a "SAV" (Save) format that is proprietary and unique to AS/400 tapes.
DEC VAX VMS and Alpha VMS
DEC (Digital Equipment Corporation / Compaq / HP) VAX VMS and Alpha VMS computers usually write a "Backup" format tape. This format is unique to VMS computers.
UNIX and Linux
UNIX and Linux computers come with a program called TAR (Tape ARchive), for writing tapes. cpio (CoPy In-Out) can also be used to write tapes on UNIX, as can Dump. All three formats are different. There is good interchangeability of TAR and some interchangeability of cpio tapes across UNIX systems.
Microsoft Windows
Although Windows systems come with a backup program, most users opt to use a third-party backup program like Arcserve, Veritas Backup Exec, Nova Backup, etc. Each of these programs writes data to tape in different ways, although some are able to read the tape format from competing products.
Apple Macintosh
Like Windows users, Macintosh users also use third-party backup programs like Retrospect.
The "File Type" we are discussing below is the file type on disk, either before it is written to tape, or after it is restored from tape. As noted above, the file type may determine the Tape Format.
What "File type" and "File content" refer to depends on both the operating system and the application that created the file, so it's difficult to make global statements that apply to all situations. The issues are considerably different for mainframes than for PCs, and are different for different kinds of files -- word processing files and database files, for example. In most cases the File Type and File Content are closely related, with overlapping issues and interactions.
File type generally refers to how the file is stored on disk, while File Content refers to what is stored in the file, including how the data is coded. However, in some cases, such as certain database files, the file type and file content are inseparably tied and generally referred to simply as the "file type". Furthermore, when the operating system doesn't support different file types, as is the case with UNIX and Windows, "file type" usually refers to the application file type, such as "an Access file" or "an SQL file", or a generic file type such as "a comma-delimited file".
Clearly the term "file type" is not used consistently, and the meaning varies greatly between operating systems. To a mainframe user, "file type" would mean "indexed" or "sequential", and "fixed length" or "variable length", both of which have very specific and different structures on disk. But for a PC user, "file type" would typically mean, for example, a "comma-delimited" file or an "Access file". This makes it difficult to communicate "file type" unless the context of the discussion is understood, and even more difficult to discuss when converting between disparate operating systems.
Because "File type" has such different meaning between mainframes and PCs, we will discuss mainframe files and PC files separately. Following those descriptions we will briefly discuss converting files between mainframes and PCs. To keep this article brief, we will mainly discuss database files.
With this background information, let's look at file type and file content.
There are some fundamental differences in how computers store files. The operating system of Mainframe computers, AS/400, DEC VMS, and others "understand" file and record structure, so you can define the type of file -- indexed or sequential for example -- within the OS. And you can store characteristics of the file, such as the record type (fixed length or variable length for example), and file parameters (such as record length) within the OS.
But the UNIX and Windows operating systems don't use such concepts; to them a file is just a stream of bytes with no structure. Those computers rely on the application programs to handle the structure. Converting between these systems then means transferring the concept of "file type" from the OS side to the applications-program side, or vice-versa.
Macintosh computers store the data portion of a file in the "Data Fork", and information about the type of file in the "Resource Fork". When converting from Macintosh, you should read the Resource Fork and use that information to interpret the file, and when converting data to a Macintosh, you must create the proper Resource Fork.
Mainframe Computers and Mid-Range Computers
Mainframe and Mid-Range operating systems define not only the name, date, and size of a file, but the type of file. They define and manage the record structure of the file, and even manage the indexing. These computers have record-management services built into the operating system to handle file I/O on a record basis, including handling the indexing. On these computers you normally read and write whole records with each I/O request.
Personal Computers
Personal computers such as Windows, UNIX, and Macintosh store files as a stream of bytes with no structure. The operating system simply reads or writes as many bytes as the application program tells it to, without regard to record boundaries, etc. In fact, the operating system doesn't even know what the record size is; it only regards files as a collection of bytes, with no structure.
It's up to the application program to handle the structure -- that is, to separate data into records. Typically the application program will make an I/O request to the operating system which specifies the number of bytes the OS is to return, and the application will then treat that data as one complete record.
Notice that from the operating system point of view there is no information with the data file which specifies the record structure (although many applications will embed that information within the file itself). In general, you can't determine the record structure from the disk file; you need separate documentation for that. However, PC files commonly delimit records with a CR-LF, and UNIX computers normally delimit records with a Newline (a LF), and those can be used to determine the record size if there is no other documentation. But because there is no record structure imposed by the OS, there is nothing to prevent shorter or longer records within the file. You normally have to scan the entire file to be sure the records are all the same size.
The topic of file content could occupy many articles. We will briefly suggest a few issues.
File content obviously refers to the data within the file, but that also has different meanings. It can mean the code set, record layout, data types, or the variable data content of each record. IBM mainframe and AS/400 computers encode the alphabet using the EBCDIC code set, while most other computers, including the IBM PC, use ASCII coding. So a simple character field on a mainframe cannot be used on a PC without an EBCDIC to ASCII conversion. Furthermore, the layout you receive with the tape will seldom specify which character set is used. That's assumed from the operating system. So the COBOL field: 05 NAME PIC X(30). will contain EBCDIC characters if it originates on a mainframe, and ASCII characters if it originates on a PC. But the layout generally won't tell you that.
Binary numeric fields are common in mainframe data, but less common on PCs, which tend to store numbers as characters. Even when binary is used on a PC, the binary data type is not the same as binary on a mainframe. PC applications can seldom understand a mainframe binary field, and may often just return the wrong value without reporting an error.
So what's considered a "standard" file type on one computer platform is not the same as a "standard" file on another platform. For example, Mainframe computers almost exclusively use fixed length records with no record delimiters, whereas Windows systems often use variable length records, and almost always use CR-LF record delimiters, even on fixed length records. Macintosh computers seldom use fixed length records, preferring variable length records with a CR record delimiter.
While "Data Conversion" generally refers to the total process of converting both the media and the data, "Media Conversion" is the term generally used when you only need to change the media, while leaving the tape format, file type, and file content unchanged. If you have the right operating system and tape program to read the tape, and an application program that can use the files, but you just don't have the right tape drive, then a media conversion is probably all you need.
Be aware, though, that even the same tape program may not write the same way to all types of tapes. For example, Arcserve, BackupExec, and others write slightly differently to a DLT tape than they do to a 4mm DDS tape, and simply copying from one to the other does not always work.
To properly convert data between different computers (a source and a destination) you need to know the type and density of the source tape, and understand the tape file format, the data file type, and the specific file contents. You also need to know the same information about the destination system. And, of course, you need the appropriate equipment and tools to perform the conversion.
A conversion Example
Let's consider converting a mainframe COBOL file containing hospital medical records to a PC for import into Access. COBOL files such as this often contain multiple record types, so lets consider such an example. In our example let's say we have two record types; the first record for each patient is a master record containing their name and address and related account information, and successive records are the treatments they received while in the hospital. There will likely be several treatment records for each patient, and the number of treatment records will vary by patient. The mainframe COBOL file will be in EBCDIC, and likely contain comp or comp-3 (binary) numeric fields, and IBM Signed fields. For simplicity we will assume all records are the same length.
Such a COBOL file might be written to a 3590 mainframe tape. Let's say this is a 3590E tape, in IBM SL (Standard-Label) FB (Fixed-Block) format. We want to get this data into Microsoft Access and write the data to a DVD for a PC. There are several possible ways to accomplish this conversion. For our example, let's say we have an IBM tape drive connected to a PC, and have the necessary tools on the PC.
Access cannot read EBCDIC, so the EBCDIC characters will need to be converted to ASCII. Access also cannot read comp or comp-3 numeric fields or IBM Signed fields, so they will need to be converted to ASCII numeric fields. And finally, Access cannot deal with multiple record types in a table, so each COBOL record type will need to be split-out to a separate Access table.
The first step is to read the tape on an IBM 3590E tape drive connected to the PC, using software that reads IBM SL FB tapes. The next step is to write a COBOL program to convert the EBCDIC characters to ASCII and convert the binary and IBM Signed fields to ASCII numeric fields, and to write each record type to a separate file. We will need a record layout to do this programming.
When multiple record types are contained in a single COBOL file, they are often associated by their relative position. In our example the first record for each patient is a master record containing their name, address, and related information, and successive records are the treatments they received while in the hospital. In the COBOL file you will know that the treatment records are associated with the patient's master record because they follow that record in the file. But Access can't handle such a file, so we will have to place each record type in a separate table. But doing so will lose the association between the patient's master record and their treatments, so we will have to add a key so Access can relate them.
So our COBOL program will read a patient master record from the mainframe file, convert the EBCDIC characters to ASCII, convert the comp and comp-3 fields to numeric ASCII, convert the IBM Signed fields to numeric ASCII, assign that patient a unique key, and write the converted record and the key to the master file, appending a CR-LF to the end of the record, as required by Access. Our program will then loop through all the treatment records for that patient, converting EBCDIC, binary, and Signed fields to ASCII, and writing each record to the treatment file using the same key. When the program encounters a new master record for a new patient, it will assign a new key for that patient and repeat the process.
There would likely be some other processing performed by our COBOL program, such as removing filler fields, testing for invalid data, dealing with redefined fields, etc.
To import this into Access, we would create two tables, one for the patient master record and one for treatments. We would define the record layout of each table using the layout for the converted file, and import the data. (The layout for the converted file will be different than the original file because the binary and Signed fields will change in size.) We would associate the two tables via the key we added to each file. The final Access database would be written to a DVD and delivered to the client.
Terms used in this example (comp, comp-3, IBM Signed, EBCDIC, etc.) are explained in our TechTalk technical articles, a few of which are listed below. We have also published a 7-part series on reading COBOL layouts. Please see "Additional Information" below.
Mainframe Tape Details A detailed description of physical and logical recording on mainframe tapes.
Mainframe Tape Terminology A brief overview of mainframe tapes, and definition of terms.
Mainframe Data Types Discusses mainframe data types.
Character, Binary, and BCD Fields Explains the three field types.
Understanding Record Size and Record Delimiters Discusses differences between mainframe records and PC records.
Converting IBM Mainframe Tape Files to PCs Discusses some practical considerations when converting mainframe tapes to PC.
Reading COBOL Layouts 7-part series that explains how to read COBOL layouts.
For more articles on data conversion,
see our TechTalk Index.
For information on our data conversion services, see
Mainframe & AS/400 Conversion to PC.
Our COBOL Conversion Services |
|||||
|
Disc Interchange Service
Company, Inc.
Media Conversion Specialists
15 Stony Brook Road
Westford, MA 01886