replace large amount of data

DoDaDew

New Member
This task requires me to find certain product code in the database and replace it with something else. There are about 1000 product code and about 5 million related records to work on. This is what I have so far.

def var lv-prim-index as char no-undo.
def var lv-file as char no-undo.

for each _field
where _field._field-name matches "prod_cd*"
no-lock:

find first _file
where recid(_file) = _field._file-recid
no-lock no-error.
if available _file then
do:
find first _index
where recid(_index) = _file._prime-index
no-lock no-error.
if available _index then
assign
lv-prim-index = _index.
run clean_prod_I.p _file._file-name _field._field-name "newCode" "oldCcode" lv-prim-index.
end.
end.

and clean_prod_I.p now is just do a simple display.

My problem is speed. For one testing product code. It took abouot 2 hours to do. I need something much better than what I have.

thanks for the help!
 

vinod_home

Member
Hi,

It is always better to analyse this as two part process. First get the list of tables that have the field ur looking for. Find which tables have a lot of records on them.

For example you could have 10 tables having the field "prod_cd" but only one history table have a million records/rows. All other 9 tables could have a few 100 records.

Comment out clean_prod_I and time your query. That should run like in few seconds.

still dont know exactly what your doing in clean_prod_i program.
hope that helps,
 

DoDaDew

New Member
thanks vinod, I think that is what I am doing. I broke down the records. Instead of going through all the records, I found the records that related to my changes. I work on those witht as many index field as possible. It still takes about 30-60 mins, but it's one-time thing.
 

Mike Smith

New Member
Hi,


Looking at the code in your first post, you find the prime-index for each file but that doesn't necessarily mean that the index has anything to do with your prod_code. That means that your later queries are using an index that may be unrelated to what you want.

For example lets say that you have a table called Parts and Prod_code is a field in that table. Also lets say Key1 is the primary index on Parts and is composed of the field PartNo. When you run your query in clean_prod_i.p you will be using an index that is for PartNo but you will be trying to find records related to prod_code, so you will have to look at every record in the table -- verrrry slow for large tables.

Instead try and find an index that uses prod_code (not necessarily the primary index) for each table.

As vinod said, you may need to break this into two parts. First determine what tables are involved and what index(es) are available for you to use. Then you may need to create separate clean programs for some tables that make optimal use of an index.

Continuing from my example above, lets say that Parts has another index called Key2 and it is composed of the fields PartNo and Prod_Code. Because PartNo is the first field, the index is bracketed on it first. If you try and use Prod_Code only you will still end up looking at all records. What you need to do is customize the query for that table to something like:

for each Parts where Parts.Partno <> "" and Parts.Prod_code = vNewProdCode use-index key2:

This causes Progress to use the Key2 index and looks first for all Parts where PartNo is not blank (there shouldn't be if it is the first field in an index) this gets us past the first bracket and able to look at the second bracket for Prod_Code.

Hope that helps some.
 

bendaluz2

Member
An inequality match in a where clause will cause progress to do a whole index scan (regardless of whether the use-index phrase is used)

for each Parts where Parts.Partno <> "" and Parts.Prod_code = vNewProdCode use-index key2

should be expressed as a range match to use an index

Code:
for each Parts
    where Parts.Partno < ""
       or Parts.Partno > ""
    and Parts.Prod_code = vNewProdCode
    no-lock
    use-index key2:

to confirm this, produce yourself an xref of the following program based on the Sports2000 database

Code:
FOR EACH customer
    WHERE customer.cust-num <> 70
    NO-LOCK:
END.

FOR EACH customer
    WHERE customer.cust-num <> 70
    NO-LOCK USE-INDEX cust-num:
END.

FOR EACH customer
    WHERE customer.cust-num < 70
       OR customer.cust-num > 70
    NO-LOCK:
END.

produces the following:

Code:
.\ix-test.p .\ix-test.p 1 COMPILE ix-test.p
.\ix-test.p .\ix-test.p 1 CPINTERNAL ISO8859-1
.\ix-test.p .\ix-test.p 1 CPSTREAM ISO8859-1
.\ix-test.p .\ix-test.p 1 STRING "Customer" 8 NONE UNTRANSLATABLE 
.\ix-test.p .\ix-test.p 1 ACCESS Sports2000.Customer Cust-Num 
.\ix-test.p .\ix-test.p 1 SEARCH Sports2000.Customer Cust-Num WHOLE-INDEX
.\ix-test.p .\ix-test.p 6 ACCESS Sports2000.Customer Cust-Num 
.\ix-test.p .\ix-test.p 6 SEARCH Sports2000.Customer Cust-Num WHOLE-INDEX
.\ix-test.p .\ix-test.p 11 ACCESS Sports2000.Customer Cust-Num 
.\ix-test.p .\ix-test.p 11 ACCESS Sports2000.Customer Cust-Num 
.\ix-test.p .\ix-test.p 11 SEARCH Sports2000.Customer Cust-Num
.\ix-test.p .\ix-test.p 11 SEARCH Sports2000.Customer Cust-Num
.\ix-test.p .\ix-test.p 15 STRING "Cust-Num" 8 NONE UNTRANSLATABLE

As you can see, the first two searches do a whole-index scan and the third uses the index cust-num
 

Mike Smith

New Member
Thanks for the information, but I'm not quite sure I understand what you mean by a whole-index scan.

When I run a test against my system on a table with 4.5 million qualifying records, I only see a 7 second difference between an inequality match and a range match (99 seconds vs. 92 seconds). I had to use a query that resulted an unusually high number of qualifying records just to get any significant (> 3 seconds) difference at all.

In fact when running the test with a query that had 250,000 qualifying records, the inequality test was two seconds faster than the range test (48 seconds vs 50 seconds).
While the range test may not be doing a whole-index scan, it appears from the xref listing that you posted that it is doing two scans. The first for all records where custnum < 70 and the second for all records where custnum > 70. Unless I'm missing somthing, that effectively equates to a whole-index scan as well.

To further test my idea, I changed the queries from an inequality test and equivelant range scan to an equality test and equivelant range scan ("<>" becomes "=" and "or" becomes "and").

The results here where 6500 qualifying records with the equality test taking less than 1 second to finish and the equivelant range scan taking 12 seconds.

From these results I would surmize that:

an equality test in an index is the most efficient
an inequality test is less effecient, but still better than a complete table scan
a range scan that is equivelant to an inequality test peforms essentially the same

If I am misunderstanding something, please let me know. I do ocaisionally have to use inequality tests and if there is a better way (faster) to write the queries I would love to know.
 
Top