Dear Readers, Welcome to Ab Initio Interview Questions and Answers have been designed specially to get you acquainted with the nature of questions you may encounter during your Job interview for the subject of Ab Initio. These Ab Initio Questions are very important for campus placement test and job interviews. As per my experience good interviewers hardly plan to ask any particular questions during your Job interview and these model questions are asked in the online technical test and interview of many IT companies.
By using rollup we cant generate cumulative summary records for that we will be using scan.
PARTITION BY KEY:
In this, we have to specify the key based on which the partition will occur. Since it is key based it results in very well balanced data. It is useful for key dependent parallelism.
PARTITION BY ROUND ROBIN:
In this, the records are partitioned in sequential way, distributing data evenly in blocksize chunks across the output partition. It is not key based and results in well balanced data especially with blocksize of 1. It is useful for record independent parallelism.
There are many ways to do it.
1. Probably the easiest way is to use Truncate Table
2. Run Sql or update table can be used to do the same thing
3. Run Program
A .dbc file has the information required for Ab Initio to connect to the database to extract or load tables or views. While .CFG file is the table configuration file created by db_config while using components like Load DB Table
There are 3 types of parallelism in ab-initio.
1) Data Parallelism:
Data is processed at the different servers at the same time.
2) Pipeline parallelism:
In this the records are processed in pipeline, i.e. the components do not have to wait for all the records to be processed. The records that got processed are passed to next component in pipeline.
3) Component Parallelism:
In this two or more components process the records in parallel.
Component parallelism:-
A graph with multiple processes running simultaneously on
separate data uses component parallelism.
Data parallelism :- A graph that deals with data divided into segments and operates on each segment simultaneously uses data parallelism. Nearly all commercial data processing tasks can use data parallelism. To support this form of parallelism, Ab Initio provides Partition components to segment data, and Departition components to merge segmented data back together .
Pipeline parallelism :- A graph with multiple components running simultaneously on the same data uses pipeline parallelism. Each component in the pipeline continuously
reads from upstream components, processes data, and writes to downstream components. Since a downstream component can process records previously written
by an upstream component, both components can operate in parallel. NOTE: To limit the number of components running simultaneously, set phases in the graph.
For converting a string to a decimal we need to typecast it using the following
syntax,
out.decimal_field :: ( decimal( size_of_decimal ) ) string_field;
The above statement converts the string to decimal and populates it to the decimal
field in output.
There are so many ways to do this, i am giving one example due to time constraint you can run components according to phasea how you defined. by creating ksh, sh scripts also you can run.
Data mapping deals with the transformation of the extracted data at FIELD level i.e. the transformation of the source field to target field is specified by the mapping defined on the target field. The data mapping is specified during the cleansing of the data to be loaded.
For Example:
source;
string(35) name = “Siva Krishna “;
target;
string(“01?) nm=NULL(“”);/*(maximum length is string(35))*/
Then we can have a mapping like:
Straight move.Trim the leading or trailing spaces.
The above mapping specifies the transformation of the field nm
Sandboxes are work areas used to develop, test or run code associated with a given project. Only one version of the code can be held within the sandbox at any
time.
The EME Datastore contains all versions of the code that have been checked into it.A particular sandbox is associated with only one Project where as a Project can be
checked out to a number of sandboxes
Environemental variables server as global variables in unix envrionment. They are used for passing on values from a shell/ process to another. They are inherited by Abinitio as sandbox variables/ graph parameters like
AI_SORT_MAX_CORE
AI_HOME
AI_SERIAL
AI_MFS etc.
To know what all variables exist, in your unix shell, find out the naming convention and type a command like “env | grep AI”. This will provide you a list of all the
variables set in the shell. You can refer to the graph parameters/ components to see how these variables are used inside Abinitio.
There are 2 types of graph parameters in AbInitio
1. local parameter
2. Formal parameters.(those parameters working at runtime)
. How to Improve Performance of graphs in Ab initio?Give some examples or tips.?
Ans: There are somany ways to improve the performance of the graphs in Abinitio.
I have few points from my side.
1.Use MFS system using Partion by Round by robin.
2.If needed use lookup local than lookup when there is a large data.
3.Takeout unnecessary components like filter by exp instead provide them in
reformat/Join/Rollup.
4.Use gather instead of concatenate.
5.Tune Max_core for Optional performance.
6.Try to avoid more phases.
Check point:
- When a graph fails in the middle of the process, a recovery point is created, known as Check point
- The rest of the process will be continued after the check point
- Data from the check point is fetched and continue to execute after correction.
Phase:
- If a graph is created with phases, each phase is assigned to some part of memory one after another.
- All the phases will run one by one
- The intermediate file will be deleted
- A graphical / program hand is known as deadlock.
- The progression of a program would be stopped when a dead lock occurs.
- Data flow pattern likely causes a deadlock
- If a graph flows diverge and converge in a single phase, it is potential for a deadlock
- A component might wait for the records to arrive on one flow during the flow converge, even though the unread data accumulates on others.
- In GDE version 1.8, the occurrence of a dead lock is very rare
EME:
- EME stands for Enterprise Metadata Environment
- It is a repository to AbInitio. It holds transformations, database configuration files, metadata and target information
GDE:
- GDE – Graphical Development Environment
- It is an end user environment. Graphs are developed in this environment
- It provides GUI for editing and executing AbInitio programs
Co-operative System:
- Co-operative system is the server of AbInitio.
- It is installed on a specific OS platform known as Native OS.
- All generated graphs in GDE are later deployed and executed in co-operative system
AbInitio supports 3 parallelisms. They are
- Data Parallelism : Same data is parallelly worked in a single application
- Component Parallelism : Different data is worked parallelly in a single application
- Pipeline Parallelism : Data is passed from one component to another component. Data is worked on both of the components.
Duplicate records can be avoided by using the following:
- Using Dedup sort
- Performing aggregation
- Utilizing the Rollup component
- MAX CORE is the space consumed by a component that is used for calculations
- Each component has different MAX COREs
- Component performances will be influenced by the MAX CORE’s contribution
- The process may slow down / fasten if a wrong MAX CORE is set
- This function is similar to the function NVL() in Oracle database
- It performs the first values which are not null among other values available in the function and assigns to the variable
Example: A set of variables, say v1,v2,v3,v4,v5,v6 are assigned with NULL.
Another variable num is assigned with value 340 (num=340)
num = first_defined(NULL, v1,v2,v3,v4,v5,v6,NUM)
The result of num is 340
- A decimal strip takes the decimal values out of the data.
- It trims any leading zeros
- The result is a valid decimal number
Ex:
decimal_strip(“-0184o”) := “-184?
decimal_strip(“oxyas97abc”) := “97?
decimal_strip(“+$78ab=-*&^*&%cdw”) := “78?
decimal_strip(“Honda”) “0?
- To make a graph behave dynamically, PDL is used
- Suppose there is a need to have a dynamic field that is to be added to a predefined DML while executing the graph
- Then a graph level parameter can be defined
- Utilize this parameter while embedding the DML in output port.
- For Example : define a parameter named myfield with a value “string(“ | “”) name;”
- Use ${mystring} at the time of embedding the dml in out port.
- Use $substitution as an interpretation option
Following is the order of evaluation:
- Host setup script will be executed first
- All Common parameters, that is, included , are evaluated
- All Sandbox parameters are evaluated
- The project script – project-start.ksh is executed
- All form parameters are evaluated
- Graph parameters are evaluated
- The Start Script of graph is executed
- Use decimal cast with the size in the transform() function, when the size of the string and decimal is same.
- Ex: If the source field is defined as string(8).
- The destination is defined as decimal(8)
- Let us assume the field name is salary.
- The function is out.field :: (decimal(8)) in salary
- If the size of the destination field is lesser that the input then string_substring() function can be used
- Ex : Say the destination field is decimal(5) then use…
- out.field :: (decimal(5))string_lrtrim(string_substring(in.field,1,5))
- The ‘ lrtrim ‘ function is used to remove leading and trailing spaces in the string
The following are the ways to improve the performance of a graph :
- Make sure that a limited number of components are used in a particular phase
- Implement the usage of optimum value of max core values for the purpose of sorting and joining components.
- Utilize the minimum number of sort components
- Utilize the minimum number of sorted join components and replace them by in-memory join / hash join, if needed and possible
- Restrict only the needed fields in sort, reformat, join components
- Utilize phasing or flow buffers when merged or sorted joins
- Use sorted join, when two inputs are huge, otherwise use hash join
Multistage transform components by default uses packages. However user can create his own set of functions in a transfer function and can include this in other transfer functions.
If the user wants to group the records on particular field values then rollup is best way to do that. Rollup is a multi-stage transform function and it contains the following mandatory functions.
1. initialise
2. rollup
3. finalise
Also need to declare one temporary variable if you want to get counts of a particular group.
For each of the group, first it does call the initialise function once, followed by rollup function calls for each of the records in the group and finally calls the finalise function once at the end of last rollup call.
Add Default Rules — Opens the Add Default Rules dialog. Select one of the following: Match Names — Match names: generates a set of rules that copies input fields to output fields with the same name. Use Wildcard (.*) Rule — Generates one rule that copies input fields to output fields with the same name.
1)If it is not already displayed, display the Transform Editor Grid.
2)Click the Business Rules tab if it is not already displayed.
3)Select Edit > Add Default Rules.
In case of reformat if the destination field names are same or subset of the source fields then no need to write anything in the reformat xfr unless you dont want to use any real transform other than reducing the set of fields or split the flow into a number of flows to achive the functionality.
Partition by Key or hash partition -> This is a partitioning technique which is used to partition data when the keys are diverse. If the key is present in large volume then there can large data skew. But this method is used more often for parallel data processing.
Round robin partition is another partitioning technique to uniformly distribute the data on each of the destination data partitions. The skew is zero in this case when no of records is divisible by number of partitions. A real life example is how a pack of 52 cards is distributed among 4 players in a round-robin manner.
There are many ways the performance of the graph can be improved.
1) Use a limited number of components in a particular phase
2) Use optimum value of max core values for sort and join components
3) Minimise the number of sort components
4) Minimise sorted join component and if possible replace them by in-memory join/hash join
5) Use only required fields in the sort, reformat, join components
6) Use phasing/flow buffers in case of merge, sorted joins
7) If the two inputs are huge then use sorted join, otherwise use hash join with proper driving port
8) For large dataset don’t use broadcast as partitioner
9) Minimise the use of regular expression functions like re_index in the trasfer functions
10) Avoid repartitioning of data unnecessarily
Try to run the graph as long as possible in MFS. For these input files should be partitioned and if possible output file should also be partitioned.
From Abinitio run sql component using the DDL “trucate table By using the Truncate table component in Ab Initio
EME is said as enterprise metdata env,
GDE as graphical devlopment env and Co-operating sytem can be said as asbinitio server relation b/w this CO-OP, EME AND GDE is as follows
Co operating system is the Abinitio Server.This co-op is installed on perticular O.S platform that is called NATIVE O.S .comming to the EME, its i just as repository in informatica , its hold the metadata,trnsformations,db config files source and targets informations. comming to GDE its is end user envirinment where we can devlop the graphs(mapping just like in informatica) desinger uses the GDE and designs the graphs and save to the EME or Sand box it is at user side where EME is ast server side.