close

Вход

Log in using OpenID

How To Scrape SMF VBulletin with Scrapebox and Import Into

embedDownload
How To Scrape SMF & VBulletin with Scrapebox and Import Into Nuclear
Link Blaster?
Basically you can use any tool to scrape links into the system, and scrapebox is the best tool for the job.
In this example, I will use scrapebox to show you how to scrape links and import into Nuclear Link
Blaster, certainly you can do that with other scraping software, but I highly recommended scrapebox.
Let's get started.
Before we go further, you need to signup with a proxy service, we recommended yourprivateproxy.com.
You need a lot private proxy to perform massive scraping operation. After setting up the proxies, we will
move on to "custom footprint".
This is the scrapebox interface, select the "Custom Footprint" .
Next you need to find the footprint to scrape those links, how do you do that? This is a big topic,
basically you find some unique words on the page of that site to find those links!
Here are some proven footprint for your reference, I'm hesitate to release this, but I've decided to make
it available to all my customers, so don't ignore these samples, they are highly effective:
"Powered by SMF" inurl:index.php?action=profile intext:Signature
inurl:index.php?action=profile intext:Signature "Powered by SMF"
"Powered by SMF" intext:Signature inurl:action=profile
inurl:action=profile intext:Signature "by SMF"
These 4 works extremely well with Google, however, because of the recent spam query that Google
faced, the Big G has decided to stop the return of "inurl" and "Powered by", so we alter the signature to
allow it run on other search engines, plus avoid Google filtering:
"by SMF" "Signature" "?action=profile"
Although this signature is not as accurate as the previous one, but it did successfully run on all search
engines and able to return massive result! Let's see this on scrapebox.
Scraping the links using "by SMF" "Signature" "?action=profile"
Next, remove duplicated sites. Please note that each search engine will return a maximum of 1,000 sites,
this will greatly reduce your ability to find more sites, one of the tricks to counter this problem is adding
"keywords".
I usually find keywords of Movies, Sports, Politics, Games, Drugs… anything you name it! You can
discover what type of sites are popular when you check existing scraped links. Go visit all the links
you've scrape in this first run to get ideas, and expand your keywords from there!
Scrapebox had include a Google Wonder Wheel scraping tool, this will allows you to generate tons of
keywords from a popular keyword, for instance weight loss.
Scrapebox will use the keywords it scraped and send it to the search engine to scrape 2nd Level of sites,
you can scrape up to 5 levels! This will generate tons of keywords for you to get started.
In this example I didn't scrape much keywords, but in actual live situation, I will scrape few hundred of
thousands keywords to generate millions of result!
Please note that scrapebox will not handle more than 1 million result in the memory, so it will save one
file in its' directory for every million result returned. You have to filter and remove duplicated domains
on each file manually, then add them all into scrapebox for final consolidation.
The files are saved in the folder:
c:\Scrapebox\Harvester_Sessions\
Please consult scrapebox for more information.
Next, let's continue to scrape more sites…
I have scraped 6800+ sites in this example, next, let's check the domain page rank now.
Next, let's export the URLs into an Excel file for further processing.
Let's open the CSV file in Excel and remove PR 0 and N/A sites:
After removing all useless sites, you still get about 3,700+ sites, we have to filter the list further to
remove non SMF sites. Because we are using the general footprint without "Inurl", you have to filter
only sites that match our criteria, in this case, I want to filter sites with correct SMF url structure:
http://uctm.edu/smf/index.php?action=profile;u=9
In the exmpale above, SMF profile must contain "action=profile", so our filter will be:
Next, we enter the filter string.
Now you will see only sites with qualified url structure. Next, we need to copy this into a new excel file
for further process!
Highlight the filtered data and copy it. Then open a new file and paste them into it.
Next, we need to prepare all sites for importing, use this formula:
=LEFT(A1,FIND("index.php", A1)-1), put this text into column "D".
Next, we will need to copy that URL for all other URLs, move your mouse to that column, position it to
the small dot at bottom right conner of the selected box.
Copy the newly generated URLs into a text file! Save the text file, let's call it NewSMFList.txt (You can
use Windows Notepad or a freeware call Notepad++, Google it to download the software).
Next, launch the "Format URL" tool to generate the proper import file.
Now, let's open the GenSMFList.txt and see if they are generated correctly. Here are some of the sites:
http://uctm.edu/smf/index.php?action=register http://uctm.edu/smf/index.php?action=login
http://uctm.edu/smf/index.php?action=login
http://www.atlas-sys.com/community/index.php?action=register http://www.atlassys.com/community/index.php?action=login http://www.atlassys.com/community/index.php?action=login
The correct format is:
RegisterURL<Space>LoginURL<Space>LoginURL<Enter&Return>
RegisterURL<Space>LoginURL<Space>LoginURL<Enter&Return>
Next, we will launch Nuclear Link Blaster to Add the Newly Generated File as Group B. Before import,
you need to create a directory to place that file, the directory name will be use as Packet Name.
Let me explain a bit about the profile sites management structure of NLB.
NLB group all sites into
1. Group A ~ J, a total of 10 Big Groups
2. Each Group can contains unlimited Packets
3. Each Packet can contains Unlimited "Small Packet"
4. Each "Small Packet" will contain 10 sites.
You can load the newly scraped links into Group A as well (The default Group of 2,000 sites), your
directory name must be different from the original directory name, in this case:
See next screen, that's the set task screen for you to assign Group, Packet and Small packet to each
profile you want to submit.
Please note that GenSMFList.txt is now ready to import into NLB to test those sites, however, to save
money and time, we have developed another tool to further filter the custom question & anti bot
question, the tool is called "CategorizeURLList". If you use "CategorizeURLList" you don't need to run
the FormatURL tool. Currently CategorizeURLList Only support "SMF" & VBulletin! Please refer to
CategorizeURL chapter for more information.
OK, enough for the explanation, let's create a folder call TempTest, put the GenSMFList.txt into the
folder and load it into NLB.
Next, we create a new profile and set up the task to run the script. Please refer to nuclear link blaster
usage guide.pdf to learn about How to Setup and Submit profile.
After running the profile, you can generate a report with all successful links, use NLB export link feature
to compare the Group B with the success report, and export only sites that are running successfully!
The first popup screen is for you to find the successful link report, locate the file and import it to
generate an export file of successful sites!
Next step, we remove the Group B sites and import the newly saved SuccessSMFList.txt into Group B,
this will replace the failed sites with the newly done and organized sites!
Some Footprint References:
SMF
"by SMF" "Signature" "?action=profile"
VBulletin
"by vBulletin" http "Signature" "member.php?" -showthread -usercp.php -faq.php -viagra -sex -pills -FDA
-adult -valium
IPBoard
" By IP.Board" "Signature" "index.php" "showuser"
"By IP.Board" "index.php" "User" "About Me"
" IP.Board" "index.php" "showuser" -"has not added a signature" -penis -adult -poker -pills -pussy -naked
-vigra -viagra "Signature" "My Content"
" By IP Board" "Signature" "index.php" "showuser"
"By IP Board" "index.php" "User" "About Me"
" IP Board" "index.php" "showuser" -"has not added a signature" -penis -adult -poker -pills -pussy -naked
-vigra -viagra "Signature" "My Content"
Expression Engine
" by ExpressionEngine" "Communications" "Bio" "member" "Script Executed in"
Document
Category
Entertainment and Humor
Views
2 224
File Size
1 473 Кб
Tags
1/--pages
Пожаловаться на содержимое документа