4 ONLY SAVANTS | JAVA: DBMS

1. DATABASE MANAGEMENT SYSTEM

1.2 Advantages of Database Systems

As displayed in the image below, the DBMS is a central system that acts as an interface between the data and the various front-end programs in the application. It also provides a central location for the entire data in the application to reside.

Due to its centralized nature, the database system can overcome the disadvantages of the file-based system as discussed below:

Minimal Data Redundancy

The whole data resides in one central database. This makes it possible for the various programs in the application to access the data present in the different data files. Hence, the data present in one file need not be duplicated in another. This reduces data redundancy. However, this does not mean that redundancy can be eliminated completely. There could be business or technical reasons for having some amount of redundancy. Any such redundancy should be carefully controlled and the DBMS should be aware of it.

Data Consistency

Reduced data redundancy leads to better data consistency.

Data Integration

The related data is stored in one single database. This makes the enforcement of data integrity much easier. Moreover, the functions in the DBMS can be used to enforce the integrity rules with minimum programming in the application programs.

Data Sharing

The data is stored in a centralized manner. So, the related data can be shared across programs. This makes it possible to even develop new applications to operate against the same data.

Enforcement of Standards

Enforcing standards in the organization and structure of data files is required and also easy in a Database System, since, it is one single set of programs that is always interacting with the data files.

Application Development Ease

The application programmer need not build the functions for handling issues like concurrent access, security, data integrity, etc. The programmer only needs to implement the application business rules. This makes application development easy. Adding additional functional modules is also easier than that in file-based systems.

Better Controls

Better controls can be achieved due to the centralized nature of the system.

Data Independence

The architecture of the DBMS can be viewed as a 3-level system comprising the following:

	The internal or the physical level where the data resides.

	The conceptual level; or the level of the DBMS functions

	The external level, or the level of application programs or the end user.

Data Independence helps to isolate an upper level from the changes in the organization or structure of a lower level. For example, if changes in the file organization of a data file do not demand for changes in the functions in the DBMS or in the application programs, data independence is achieved. Thus Data Independence can be defined as immunity of applications to change in physical representation and access technique. The provision of data independence is a major objective for database systems.

Reduced Maintenance

Maintenance is less and easy, due to the centralized nature orf the system.

1.3 Functions of a DBMS

The functions performed by a typical DBMS are the following:

	Data Definition

	The DBMS provides functions to define the structure of the data in the application. These include defining and modifying the record structure, the type and size of fields and the various constraints/conditions to be satisfied by the data in each field.

	Data Manipulation

	Once the data structure is defined, data needs to be inserted, modified or deleted. The functions which perform these operations are also part of the DBMS. These function can handle planned and unplanned data manipulation needs. Planned queries are those which form part of the application. Unplanned queries are ad-hoc queries which are performed on a need basis.

	Data Security & Integrity

	The DBMS contains functions which handle the security and integrity of data in the application. These can be easily invoked by the application and hence the application programmer need not code these functions in his/her programs.

	Data Recovery & Concurrency

	Recovery of data after a system failure and concurrent access of records by multiple users are also handled by the DBMS.

	Data Dictionary Maintenance

	Maintaining the Data Dictionary which contains the data definition of the application is also one of the functions of a DBMS.

	Performance

	Optimizing the performance of the queries is one of the important functions of a DBMS. Hence the DBMS has a set of programs forming the Query Optimizer which evaluates the different implementations of a query and chooses the best among them.

Thus the DBMS provides an environment that is both convenient and efficient to use when there is a large volume of data and many transactions to be processed.

1.4 Role of the Database Administrator

Every DBMS would typically have three types of users. They are :

1.	The End Users; who use the application. Ultimately, these are the users who actually put data in the system into use in business. The end users might not be aware of anything regarding the organization of data at the physical level. They might also be not aware of the entire data available in the system. They need to have access and knowledge of only the data they use.

2.	The Application Programmers; who develop the application programs. They have better knowledge about the data and its structure. That is because they have to manipulate the data using their programs. They also need not have access to and knowledge about the entire data available in the system.

3.	The Database Administrators (DBAs); they are the super-users of the system. The role of the DBAs is very important and is defined by the following functions.

Defining the Schema:

The DBAs define the schema which contains the structure of the data in the application. The DBAs determine what data needs to be present in the system and how that data has to be represented and organized.

Liaising with Users:

The DBAs would need to interact continuously with the users to understand the data in the system and its use.

Defining Security & Integrity Checks:

The DBAs keep a check on the access restrictions to be defined and then define the security checks accordingly. Data Integrity checks are also defined by the DBAs.

Defining Backup / Recovery Procedures:

The DBAs also define the procedures for backup and recovery. Defining the backup procedures include specifying what data is to be backed up, the periodicity of taking backups and also the medium and storage place for the backup data.

Monitoring Performance:

The DBAs have to continuously monitor the performance of the queries and take appropriate measures to optimize all the queries in the application.

1.5 Types of Database Systems

Database Systems can be catagorised based on the data structures and operators they present to the user. The oldest systems fall into inverted list, hierarchic and network systems. These are the pre-relational models.

In the Hierarchical Model, different records are inter-related through hierarchical or tree-like structures. A parent record can have several children, but a child can have only one parent. In the figure, there are two hierarchies shown - the first storing the relations between CUSTOMER, ORDERS, CONTACTS and ORDER_PARTS and the second showing the relation between PARTS, ORDER_PARTS and SALES_HISTORY. The many-to-many relationship is implemented through the ORDER_PARTS segment which occurs in both the hierarchies. In practice, only one tree stores the ORDER_PARTS segment, while the other has a logical pointer to this segment. IMS (Information Management System) of IBM is an example of a Hierarchical DBMS.

In the Network Model, a parent can have several children and a child can also have many parent records. Records are physically linked through linked-lists. IDMS from Computer Associates International Inc. is an example of a Network DBMS.

In the Relational Model, unlike the Hierarchical and Network models, there are no physical links. All data is maintained in the form of tables consisting of rows and columns. Data in two tables is related through common columns and not physical links or pointers. Operators are provided for operating on rows in tables. Unlike the other two type of DBMS, there is no need to traverse pointers in the Relational DBMS. This makes querying much more easier in a Relational DBMS than in the the Hierarchical or Network DBMS. This, in fact, is a major reason for the relational model to become more programmer friendly and much more dominant and popular in both industrial and academic scenarios. Oracle, Sybase, DB2, Ingres, Informix, MS-SQL Server are few of the popular Relational DBMSs.

CUSTOMER
CUST. NO.	CUSTOMER NAME	ADDRESS	CITY
15371	Nanubhai & Sons	L. J. Road	Mumbai
...	...	...	...
...	...	...	...
...	...	...	...

CONTACTS			ORDERS
CUST. NO.	CONTACT	DESIGNATION	ORDER NO.	ORDER DATE	CUSTOMER NO.
15371	Nanubhai	Owner	3216	24-June-1997	15371
15371	Rajesh Munim	Accountant	...	...	...
...	...	...	...	...	...
...	...	...	...	...	...

PARTS			ORDERS-PARTS
PARTS NO.	PARTS DESC	PART PRICE	ORDER NO.	PART NO.	QUANTITY
S3	Amkette 3.5" Floppies	400.00	3216	C1	300
...	...	...	3216	S3	120
...	...	...	...	...	...
...	...	...	...	...	...

SALES-HISTORY
PART NO.	REGION	YEAR	UNITS
S3	East	1996	2000
S3	North	1996	5500
S3	South	1996	12000
S3	West	1996	20000

The recent developments in the area have shown up in the form of certain object and object/relational DBMS products. Examples of such systems are GemStone and Versant ODBMS. Research has also proceeded on to a variety of other schemes including the multi-dimensional approach and the logic-based approach.

The External Level represents the collection of views available to different end-users.

The Conceptual level is the representation of the entre information content of the database.

The Internal level is the physical level which shows how the data is stored, what are the representation of the fields etc.

2. Advanced Database Concepts

While dealing with physical storage of data,the main objective would be to optimize performance by minimizing the number of disk accesses during the various database operations.

The image above displays the process of database access in general. The DBMS considers the database as a collection of records. The File Manager of the underlying Operating System considers it as a set of pages, and the Disk Manager considers it as a collection of physical locations on the disk.

When the DBMS makes a request for a specific record to the File Manager, the latter maps the record to a page containing it and requests the Disk Manager for the specific page. The Disk Manager determines the physical location on the disk and retrieves the required page.

2.1 Clustering

In this process, if the page containing the requested record is already present in the memory, retrieval from the disk might not be necessary. So, the time taken for the whole operation would be lesser. Thus, if records which are frequently used together are placed physically together, more records will be in the same page. Hence the number of pages to be retrieved would be less and this reduces the number of disk accesses. This leads to a better performance.

This method of storing logically related records, physically together is called clustering.

Eg: Consider CUSTOMER table as shown below.

Cust ID	Cust Name	Cust City	...
10001	Raj	Delhi	...
10002	...	...	...
10003	...	...	...
10004	...	...	...
...	...	...	...
...	...	...	...

If queries that retrieve Customers with consecutive Cust_IDs occur frequently in the application, clustering based on Cust_ID will help to improve the performance of these queries. This can be explained as follows:

Assume that the Customer record size is 128 bytes and the typical size of a page retrieved by the File Manager is 1 Kb (1024 bytes).

If there is no clustering, it can be assumed that the Customer records are stored at random physical locations. In the worst-case scenario, each record may be placed in a different page. Hence a query to retrieve 100 records with consecutive Cust_Ids (say, 10001 to 10100), will need to access 100 pages. This further translates into 100 disk accesses.

But, if the records are clustered, a page can contain 8 records. Hence the number of pages to be accessed for retrieving the 100 consecutive records will be ceil(100/8) = 13. i.e., only 13 disk accesses will be required to obtain the query results. This shows that clustering improves the speed by a factor of 7.7

Q:	For what record size will clustering be of no benefit to improve performance?

A:	When the record size and page size are such that a page can contain only one record.

Q:	Can a table have clustering on multiple fields simultaneously?

A:	No

Intra-file Clustering:

Clustered records belong to the same file (table) as in the above example.

Inter-file Clustering:

Clustered records belong to different files (tables). This type of clustering may be required to enhance the speed of queries retrieving related records from more than one table. Here interleaving of records is used.

2.2 Indexing

Indexing is a common method for faster retrievals.

Consider the example of CUSTOMER table used above. The following query is based on the Customer's city.

"Retrieve the records of all customers who reside in Delhi"

Here, a sequential search on the CUSTOMER table has to be carried out and all records with the value 'Delhi' in the Cust_City field have to be retrieved. The time taken for this operation would depend upon the number of pages to be accessed. If the records are randomly stored, then the page accesses depend on the volume of data. If the records are stored physically together, the number of pages depends on the size of each record also.

If such queries based on Cust_City field are very frequent in the application, you can take steps to improve the performance of these queries. One such method is to create an index on Cust_City. This would result in the following scenario.

A new index file is created. The number of records in the index file is same as that in the data file. The index file has two fields in each record; one field contains the value of the Cust_City field and the second contains a pointer to the actual data record in the CUSTOMER table.

Whenever a query based on Cust_City field occurs, a search would be carried out on the Index file. This search would be a lot faster when compared to a sequential search in the CUSTOMER table, if the records are stored physically together. This is because, the index record is much smaller in size, and therefore, each page would be able to contain more number of records.

When the records with value 'Delhi' in the Cust_City field are located in the index file, the pointer in the second field of the records can be followed to directly retrieve the corresponding CUSTOMER records.

This type of access involves a Sequential access on the index file and a Direct access on the actual data file.

Retrieval Speed v/s Update Speed : Though indexes help making retrievals faster, they slow down updates on the table. This is because, the updates to the base table demand updates on the index fielda as well.

It is possible to create an index with multiple fields i.e., index on field combinations. Multiple indexes can also be created on the same table simultaneously though there may be a limit on the maximum number of indexes that can be created on a table.

Q:	Among the following situations, identify the ones where indexes will be ineffective?

	a) When the percentage of rows being retrieved is large

	b) When the data table is small and the index record is of almost the same size as of the actual data record.

	c) In queries involving NULL / Not NULL in the indexed field.

	d) All of the above

A:	d) All of the above

Q:	Can a clustering based on one field and indexing on another field exist on the same table simultaneously?

A:	Yes

2.3 Hashing

Hashing is yet another method used for making retrievals faster. This method provides direct access to record on the basis of the value of a specific field called the hash_field. Here, when a new record is inserted, it is physically stored at an address which is computed by applying a mathematical function (hash function) to the value of the hash field. Thus for every new record,

hash_address = f (hash_field), where f is the hash function.

Later, when a record is to be retrieved, the same hash function is used to compute the address where the record is stored. Retrievals are faster since a direct access is provided and there is no search involved in the process.

An example of a typical hash function is given by a numeric hash field, say an id, modulus a very large prime number.

Q:	Can there be more than one hash fields on a file?

A:	No

As hashing relates the field value to the address of the record, multiple hash fields will map a record to multiple addresses at the same time. Hence there can be only one hash field per file.

Collisions : Consider the example of the CUSTOMER table given earlier while discussing clustering. Let CUST_ID be the hash field and the hash function be defined as ((CUST_ID mod 10000)*64 + 1025). The records with CUST_ID 10001, 10002, 10003 etc. will be stored at addresses 1089, 1153, 1217 etc. respectively.

It is possible that two records hash to the same address leading to a collision. In the above example, records with CUST_ID values 20001, 20002, 20003 etc. will also map on to the addresses 1089, 1153, 1217 etc. respectively. And same is the case with CUST_ID values 30001, 30002, 30003 etc.

The methods to resolve a collision are by using :

1. Linear Search:

While inserting a new record, if it is found that the location at the hash address is already occupied by a previously inserted record, search for the next free location available in the disk and store the new record at this location. A pointer from the first record at the original hash address to the new record will also be stored. During retrieval, the hash address is computed to locate the record. When it is seen that the record is not available at the hash address, the pointer from the record at that address is followed to locate the required record.

In this method, the over head incurred is the time taken for the linear search to locate the next free location while inserting a record.

2. Collision Chain:

Here, the hash address location contains the head of a list of pointers linking together all records which hash to that address.

In this method, an overflow area needs to be used if the number of records mapping on to the same hash address exceeds the number of locations linked to it. For additional reference, click here.

4 ONLY SAVANTS | JAVA

DBMS

No comments:

Post a Comment

TWITTER

Labels

Pages

My Blog List

About Me